텍스트 벡터화(단어 임베딩)

New Collar Level 2

텍스트 벡터화(단어 임베딩)

Aaron P 2024. 6. 21. 17:12

예제1-1) 문장들을 학습하여 '리스크' 와 가장 연관성있는 5개 단어 추출하기

조건)

- 데이터 위치 : data_set\\risk_noun.txt

risk_noun.txt

0.08MB

- 사용함수 : Word2Vec, gensim.models 모듈의 KeyedVectors 함수

- Word2Vec 학습조건 - 연관단어갯수: 2, 최소빈도수 : 1, 학습빠르기 : workers = 4, CBOW 모델

답)

[('위험', 0.9860552549362183), ('관리', 0.9852122068405151), ('기업', 0.9813206791877747), ('위험관리', 0.9809971451759338), ('평가', 0.976618230342865)]

코드)

f = open('data_set\\risk_noun.txt', 'r', encoding = 'utf-8')
raw = f.read() # String

risk_noun = eval(raw) # 스트링을 리스트로 변환
type(risk_noun)

# print(risk_noun[:5])

from gensim.models import Word2Vec

model = Word2Vec(risk_noun, window=2, min_count=1, workers=4, sg=0) 
#min_count : 단어의 최소 빈도수를 설정, window = target 단어의 context로 묶이는 좌우의 각 단어 수, 
#  같은 크기의 말뭉치에 대해 Skip-gram의 학습량이 더 크기 때문에(=임베딩 품질이 더 좋다!), CBOW보다는 대부분 Skip-gram 방식을 사용
# workers is the number of threads for the training of the model, higher number = faster training. 
# sg defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.

model.wv.save_word2vec_format('word2') # 학습된 모델을 저장

value = model.wv.most_similar('리스크', topn=5) #top5: '리스크'와 가장 관련이 깊은 상위 5개 단어
print(value)

예제1-2) 상기 학습한 word2 모델을 기반으로 '마케팅' 연관 검색어 5단어 불러오기

결과)

[('이해', 0.5496183633804321),
 ('의미', 0.5442449450492859),
 ('활동', 0.5360015630722046),
 ('공급', 0.5355146527290344),
 ('개별', 0.5308396220207214)]

코드)

load_model = KeyedVectors.load_word2vec_format('word2') #학습된 모델 불러오기
load_model.most_similar('마케팅', topn=5) #마케팅과 유사도 높은 단어 추출