Python 資料科學指南 1 5使用集合

集合：不能存在重複值，是無序的同類元素的集合，通常情況下，集合被用來刪除列表中的重複值。

操作：集合支援交集、並集、差集和對稱差等操作。

# --encoding:utf-8--
#1.初始化兩個句子
st_1 = "dogs chase cats"
st_2 = "dogs hate cats"
#2.從字串中建立詞的集合，st_1.split()返回列表，我們將其傳遞給set()函式，去重，並獲取集合物件。
st_1_wrds = set(st_1.split())
st_2_wrds = set(st_2.split())
#3.找出每個集合中不重複的詞總數，即詞表大小
no_wrds_st_1 = len(st_1_wrds)
no_wrds_st_2 = len(st_2_wrds)
#4.找出兩個集合中共有的詞，儲存到列表中，並統計總數
cmn_wrds = st_1_wrds.intersection(st_2_wrds)
no_cmn_wrds = len(st_1_wrds.intersection(st_2_wrds))
#5.找出兩個集合並集中不重複的值，儲存到列表中，並統計總數
#union是將兩個集合進行並集，並將不重複的那些詞列出來。這在自然語言處理中被稱為詞表。
unq_wrds = st_1_wrds.union(st_2_wrds)
no_unq_wrds = len(st_1_wrds.union(st_2_wrds))
#6.計算jaccard相似度
similarity = no_cmn_wrds / (1.0 * no_unq_wrds)
#7.列印輸出
print ("no words in sent_1 = %d"%(no_wrds_st_1))
print ("sentence 1 word = ",st_1_wrds)
print ("no words in sent_2 =%d"%(no_wrds_st_2))
print ("sentence 2 words=",st_2_wrds)
print ("no words in common =%d"%(no_cmn_wrds))
print ("common words = ",cmn_wrds)
print ("total unique words =%d"%(no_unq_wrds))
print ("unique words=",unq_wrds)
print ("similarity = no words in common/no unique words,%d/%d=%.2f"%(no_cmn_wrds,no_unq_wrds,similarity))

no words in sent_1 = 3

sentence 1 word =

no words in sent_2 =3

sentence 2 words=

no words in common =2

common words =

total unique words =4

unique words=

similarity = no words in common/no unique words,2/4=0.50

第一步：通過set()函式，將元組或列表轉換為集合型別。該過程會丟棄重複元素，並返回乙個集合物件。

例如：>>> a=(1,2,1) #元組->集合

>>>set(a)

set([1,2])

>>>b=[1,2,1] #列表->集合

>>>set(b)

set([1,2])

第二步：採用jaccard係數計算兩個句子之間的相似度。具體採用intersection()和union()函式對集合進行操作，來計算相似度。

jaccard=兩個集合共有的詞數量/兩個集合並集中不重複的詞總數

1.從scikit-learn之類的庫中使用內建函式。我們可以盡可能多的使用這些函式，而不必親自寫那些集合的應用函式。

示例**：

# --encoding:utf-8--
#載入庫
from sklearn.metrics import jaccard_similarity_score
#1.初始化兩個句子
st_1 = "dogs chase cats"
st_2 = "dogs hate cats"
#2.從字串中建立詞的集合
st_1_wrds = set(st_1.split())
st_2_wrds = set(st_2.split())
unq_wrds = st_1_wrds.union(st_2_wrds)
a=[1 if w in st_1_wrds else 0 for w in unq_wrds]
b=[1 if w in st_2_wrds else 0 for w in unq_wrds]
print (a)
print (b)
print (jaccard_similarity_score(a,b))

輸出結果：

[1, 0, 1, 1]

[1, 1, 0, 1]

0.5

注意：輸出結果前兩行，每次執行，0的位置可能會不一致，原因在於union對兩個集合取並集時，集合是無序的，每次執行出來各個單詞所在的位置都不一致，因此在最後判斷a（或b）中某個單詞不存在時，0的位置也會不一致。

python資料科學實踐指南

python io讀寫外部資料資料科學的第三方庫 python圖資料分析庫大資料工具 1.資料科學的過程分為資料採集資料清洗資料處理和資料查詢與視覺化。2.資料科學需要的技能 python程式語言演算法資料庫作業系統概率與統計線性代數和英語。3.程式設計學習 codecademy...

《Python資料科學指南》 1 22 列表排序

我們先討論列表排序，然後擴充套件到對其他可迭代物件的排序。排序有兩種方法，第1種是使用列表裡內建的sort函式。第2種是使用sorted函式。我們通過示例來進行說明。我們來看看如何使用sort和sorted函式。先看一小段對給定的列表進行排序 a 8,0,3,4,5,2,9,6,7,1 b 8,0...

Github首選資料科學入門指南

最近，在github上發現了乙份資料科學的入門套路無論你是從未嘗試過編寫還是你需要深入複習相關數學知識，都可以在這份開源專案裡找到你想要的比如你是一位剛剛接觸資料科學的新手，那麼，你就可以看相對應的python基礎知識，數學基礎高階python 高階數學以及資料科學等知識如果你是一名專業...

Python 資料科學指南 1 5使用集合

python資料科學實踐指南

《Python資料科學指南》 1 22 列表排序

Github首選資料科學入門指南

相關推薦