비정형 데이터의 불용어 제거하기

Aaron P 2024. 6. 18. 21:07

2024. 6. 18. 21:07

예제1) 하기 sentence에서 불용어(stop_words)를 제거하기

조건)

- 불용어 리스트 : on, in the

- 문장 : 'singer on the stage

답)

['singer', 'stage']

코드)

stop_words = 'on in the'

stop_words = stop_words.split()  #공백을 기준으로 나누기

sentence = 'singer on the stage'
sentence = sentence.split()

nouns = []

# for i in range(0,len(sentence)):
#     if sentence[i] not in stop_words:
#         nouns.append(sentence[i])

for noun in sentence:
    if noun not in stop_words:
        nouns.append(noun)

nouns

예제2) nltk 패키지의 불용어 리스트를 사용해보기

조건)

- 예제 문장 : If you do not walk today, you will have to run tomorrow

- 사용 함수: nltk.corpus 모듈의 stopwords, nltk.tokenize 모듈의 word_tokenize

결과)

['If', 'walk', 'today', ',', 'run', 'tomorrow']

코드)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = stopwords.words('english')

# 토큰화

s = 'If you do not walk today, you will have to run tomorrow'

words = word_tokenize(s)

nouns = []

for noun in words:
    if noun not in stop_words:
        nouns.append(noun)

nouns

예제3) 하기 한국어 문장에서 불용어 제거하기

조건)

- 문장: 걷는 놈 위에 뛰는 놈 있고, 뛰는 놈 위에 나는 놈 있다. 나는 놈 위에는 즐기는 놈이 있다.

- 사용함수 : kiwipiepy.utils 모듈의 Stopwords

결과)

['걷',
 '놈',
 '위',
 '뛰',
 '놈',
 '뛰',
 '놈',
 '위',
 '날',
 '놈',
 '있',
 '나',
 '놈',
 '위',
 '즐기',
 '놈']

코드)

from kiwipiepy.utils import Stopwords
from kiwipiepy import Kiwi

stopwords = Stopwords()
kiwi = Kiwi()

text = '걷는 놈 위에 뛰는 놈 있고, 뛰는 놈 위에 나는 놈 있다. 나는 놈 위에는 즐기는 놈이 있다.'

token_text = kiwi.tokenize(text, stopwords=stopwords)

token_kr = []
for t in token_text:
    token_kr.append(t.form)

token_kr

'New Collar Level 2' 카테고리의 다른 글

텍스트 벡터화(One-hot encoding, BoW, DTM, TF-IDF) (0)	2024.06.19
비정형 데이터의 전처리, 형태소 분석(빈도 계산, wordcloud) (0)	2024.06.18
비정형 데이터의 단어 토큰화 (0)	2024.06.17
업무 자동화 - Windows 창 활성화 (0)	2024.06.16
웹페이지를 파싱해서 데이터프레임으로 가져오기 (0)	2024.05.06

Python으로 떠나는 여행

비정형 데이터의 불용어 제거하기

'New Collar Level 2' 카테고리의 다른 글

+ Recent posts

티스토리툴바