文字向量化

文字向量化，就是把文字轉化為向量形式。這裡用兩種方式實現本文向量，一種是tf方式，一種是tf-idf方式，且這裡向量的長度就是字典的長度。

計算兩個向量余弦相似度

import math
defcount_cos_similarity
(vec_1, vec_2):if
len(vec_1)
!=len
(vec_2)
:return
0    s =
sum(vec_1[i]
* vec_2[i]
for i in
range
(len
(vec_2)))
den1 = math.sqrt(
sum(
[pow
(number,2)
for number in vec_1]))
den2 = math.sqrt(
sum(
[pow
(number,2)
for number in vec_2]))
return s /
(den1 * den2)

tf 文字向量及相似性計算

from sklearn.feature_extraction.text import countvectorizer
sent1 =
"the cat is walking in the bedroom."
sent2 =
"the dog was running across the kitchen."
count_vec = countvectorizer(
)sentences =
[sent1, sent2]
print
(count_vec.fit_transform(sentences)
.toarray())
print
(count_vec.get_feature_names())
vec_1 = count_vec.fit_transform(sentences)
.toarray()[
0]vec_2 = count_vec.fit_transform(sentences)
.toarray()[
1]print
(count_cos_similarity(vec_1, vec_2)
)

結果為：

[[0
1101
1002
10][
1001
0011
201]
]['across'
,'bedroom'
,'cat'
,'dog'
,'in'
,'is'
,'kitchen'
,'running'
,'the'
,'walking'
,'was'
]0.4444444444444444

說明：依次輸出每個文字的向量表示、每個維度對應的詞語、以及文字余弦相似度。

tf-idf 文字向量及相似性計算

from sklearn.feature_extraction.text import tfidfvectorizer
sent1 =
"the cat is walking in the bedroom."
sent2 =
"the dog was running across the kitchen."
tfidf_vec = tfidfvectorizer(
)sentences =
[sent1, sent2]
print
(tfidf_vec.fit_transform(sentences)
.toarray())
print
(tfidf_vec.get_feature_names())
vec_1 = tfidf_vec.fit_transform(sentences)
.toarray()[
0]vec_2 = tfidf_vec.fit_transform(sentences)
.toarray()[
1]print
(count_cos_similarity(vec_1, vec_2)
)

結果為：

[[0
.0.37729199
0.377291990.
0.37729199
0.377291990.
0.0.53689271
0.377291990.
][0.377291990.
0.0.377291990.
0.0.37729199
0.37729199
0.536892710.
0.37729199]]
['across'
,'bedroom'
,'cat'
,'dog'
,'in'
,'is'
,'kitchen'
,'running'
,'the'
,'walking'
,'was'
]0.28825378403927704

說明：輸出同上

上文示例中給了兩個句子：

」the cat is walking in the bedroom.」

」the dog was running across the kitchen.」

這兩個句子其實從語義上看特別相似，但是實際得到的相似性卻很低~~本質上原因在於兩種方式計算的文字向量，都只能衡量文字之間的內容相似度，但難以衡量其中語義相似度。

**於：宇毅

文字向量化

table of contents概述 word2vec nnlm c wcbow and skip gram doc2vec str2vec 文字表示是自然語言處理中的基礎工作，文字表示的好壞直接影響到整個自然語言處理系統的效能。文字向量化就是將文字表示成一系列能夠表達文字語義的向量，是文字表示的...

文字資訊向量化

from sklearn.feature extraction.text import countvectorizer countvec countvectorizer min df 2 兩個以上文件出現的才保留文件的詞與詞之間用空格分隔 x countvec.fit transform 我們都...

文字向量化筆記（二）

神經網路語言模型是經典的三層前饋神經網路結構，其中包括三層輸入層隱藏層和輸出層。為解決詞袋模型資料稀疏問題，輸入層的輸入為低維度的緊密的詞向量，輸入層的操作就是將詞序列中的每個詞向量按順序拼接，在輸入層得到式 7.2 的x 後，將x 輸入隱藏層得到h 再將h 接人輸出層得到最後的輸出變數y ...

文字向量化

文字向量化

文字資訊向量化

文字向量化筆記（二）

相關推薦