230515 / BSA11. CountVectorizer, TfidfVectorizer

BSA07_Python_SMS-Spam.ipynb

1. CountVectorizer

예제문장 =  ['This is the first document.', 'This document is the second document.', 
         'And this is the third one.', 'Is this the first document?']  # 단어 빈도 계산

vectorizer = CountVectorizer()
토큰개수 = vectorizer.fit_transform(예제문장)
vectorizer.get_feature_names_out()
# 출력결과
# array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], dtype=object)

print(vectorizer.vocabulary_)
# 출력결과
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
# 'this': 8 (this가 8번째 위치에 있다)

print(토큰개수.toarray())
# 첫 번째, 두 번째, 네 번째 문장에는 and가 없고, 세 번째 문장에는 and가 있음
# 첫 번째 문장에는 document가 있고, 두 번째 문장에는 document가 2개 있음

2. ngram을 이용한 CountVectorizer

vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
토큰개수2 =  vectorizer2.fit_transform(예제문장)
vectorizer2.get_feature_names_out()
# 출력결과
# array(['and this', 'document is', 'first document', 'is the', 'is this','second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the'], dtype=object)
# 알파벳 순서로 출력됨

print(vectorizer2.vocabulary_)
# 출력결과
# {'this is': 11, 'is the': 3, 'the first': 6, 'first document': 2, 'this document': 10, 'document is': 1, 'the second': 7, 'second document': 5, 'and this': 0, 'the third': 8, 'third one': 9, 'is this': 4, 'this the': 12}

print(토큰개수2.toarray())  
# 'this is'는 세 번째 문장에 있고, 나머지 문장에는 없음

3. 특정 단어의 빈도수를 계산하는 CountVectorizer

단어장 = ['this', 'document', 'first', 'is', 'second', 'the', 'and', 'one']  # 이 단어만 가지고 빈도수를 계산  
vectorizer3 = CountVectorizer(vocabulary=단어장)
토큰개수3 =  vectorizer3.fit_transform(예제문장)
vectorizer3.get_feature_names_out()
# 출력결과
# array(['this', 'document', 'first', 'is', 'second', 'the', 'and', 'one'], dtype=object)
      
print(vectorizer3.vocabulary_)
# 출력결과
# {'this': 0, 'document': 1, 'first': 2, 'is': 3, 'second': 4, 'the': 5, 'and': 6, 'one': 7}

print(토큰개수3.toarray())

4. TfidfVectorizer를 이용한 문장 유사도 측정

예제문장 =  ['This is the first document.', 'This document is the second document.', 
         'And this is the third one.', 'Is this the first document?']
         
tfidf = TfidfVectorizer(stop_words = 'english').fit(예제문장)
print(tfidf.vocabulary_)
# 출력결과
# {'document': 0, 'second': 1}

print(tfidf.transform(예제문장).toarray())  
# 이 값들의 유사성을 가지고 비슷한 문장인지 판정 가능
# 2와 3은 (1,4)와 다른 데이터라고 해석 가능
# 1과 4는 거의 같은 문장(데이터)라고 해석 가능

# 파이프라인 : 여러 개의 작업을 하나의 작업으로 모으는 것
파이프 = Pipeline([('count', CountVectorizer(vocabulary=단어장)),
                 ('tfid', TfidfTransformer())]).fit(예제문장)
                 
# 파이프라인에서 첫 번째 작업만 하는 경우
# 파이프['count'] => CountVectorizer(vocabulary=단어장)
토큰개수  = 파이프['count'].transform(예제문장)

print(토큰개수.toarray())

파이프['tfid'].idf_

# 문자열 -> 수치형 자료로 바꾸는 작업
파이프.transform(예제문장).toarray()  # 이 값을 가지고 y값 예측

참고

https://daily-life-in-20s.tistory.com/352

저작자표시

'Statistics > BSA' 카테고리의 다른 글

230517 / BSA11. pyspark에서 통계 모델링 (0)	2023.05.20
230517 / BSA11. python에서 통계 모델링 (0)	2023.05.20
230515 / BSA11. NLP (Natural Language Processing) (0)	2023.05.20
230508 / BSA10. pyspark에서 스팸 메일 분류 (0)	2023.05.14
230508 / BSA10. pyspark의 Natural Language Processing (0)	2023.05.14

'Statistics > BSA' 카테고리의 다른 글

티스토리툴바