군집 분석 - 텍스트 유사도 (자카드, 코사인)

New Collar Level 2

군집 분석 - 텍스트 유사도 (자카드, 코사인)

Aaron P 2024. 6. 21. 17:40

예제1) 아래 문장들을 활용하여 각 문장들의 자카드 유사도를 계산하기.

조건)

- 문장 : 'The sky is blue', 'The sun is bright', 'The sun in the sky is bright'

- 사용패키지 : nltk

ㅇ- 사용함수 : word_tokenize, nltk.stem 모듈의 WordNetLemmatizer

결과)

The sky is blue & The sun is bright : 0.3333333333333333
The sky is blue & The sun in the sky is bright : 0.42857142857142855
The sun is bright & The sun in the sky is bright : 0.6666666666666666

코드)

# 데이터 입력
d1 = 'The sky is blue'
d2 = 'The sun is bright'
d3 = 'The sun in the sky is bright'

#Jaccard similarity

from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

#Jaccard 유사도 함수 생성

def jaccard_similarity(d1, d2):
    lemmatizer = WordNetLemmatizer()
    words1 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(d1)]
    words2 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(d2)]
    inter = len(set(words1).intersection(set(words2)))
    union = len(set(words1).union(set(words2)))
    return inter/union

#Jaccard 유사도 출력 : 두 집합이 동일하면 1의 값을 가지고, 공통의 원소가 하나도 없으면 0의 값을 가진다
print(f'd1, d2 : {jaccard_similarity(d1, d2)}')  
print(f'd2, d3 : {jaccard_similarity(d1, d3)}')
print(f'd2, d3 : {jaccard_similarity(d2, d3)}')

예제2) 아래 문장들을 활용하여 각 문장들의 코사인 유사도를 계산하기.

조건)

- 문장 : 'The sky is blue', 'The sun is bright', 'The sun in the sky is bright'

- 사용패키지 : sklearn, numpy, pandas
- 사용함수 : TfiidfVectorizer, sklearn.metrix.pariwise 모듈의 cosine_simularity

결과)

Cosine Similarity 계산
   The sky is blue  The sun is bright  The sun in the sky is bright
0         1.000000           0.339579                      0.508890
1         0.339579           1.000000                      0.764461
2         0.508890           0.764461                      1.000000

코드)

# 데이터 입력
d1 = 'The sky is blue'
d2 = 'The sun is bright'
d3 = 'The sun in the sky is bright'

# Cosine Similarity 계산

#두 벡터의 방향이 완전히 동일한 경우에는 1 의 값을 가지며 , 90 의 각일때 0, 반대방향이면 1 의 값을 가짐
#코사인 유사도는 1 이상 1 이하의 값을 가지며 값이 1 에 가까울수록 유사도가 높다고 판단


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 
import numpy as np
import pandas as pd

#tf idf 계산 : Numpy로 배열을 만들어 계산
docs = np.array([d1,d2,d3])

tfidf_matrix = TfidfVectorizer().fit_transform(docs)
similarity = cosine_similarity(tfidf_matrix, tfidf_matrix)

df = pd.DataFrame(similarity, columns = [d1,d2,d3])
df