準備資料:從文字中構建詞向量
def
loaddataset()
: postinglist=[[
'my'
,'dog'
,'has'
,'flea'
,'problem'
,'help'
,'please'],
['maybe'
,'not'
,'take'
,'him'
,'to'
,'dog'
,'park'
,'stupid'],
['my'
,'dalmation'
,'is'
,'so'
,'cute'
,'i'
,'love'
,'him'],
['stop'
,'posting'
,'stupid'
,'worthless'
,'garbage'],
['mr'
,'licks'
,'ate'
,'my'
,'steak'
,'how'
,'to'
,'stop'
,'him'],
['quit'
,'buying'
,'worthless'
,'dog'
,'food'
,'stupid']]
classvec=[0
,1,0
,1,0
,1]return postinglist,classvec
defcreatevocablist
(dataset)
: vocabset=
set(
)for document in dataset:
vocabset=vocabset|
set(document)
return
list
(vocabset)
defwordstovec
(vocablist,inputset)
: returnvec=[0
]*len(vocablist)
for word in inputset:
if word in vocablist:
returnvec[vocablist.index(word)]=
1else
:print
("the word:%s is not in my vocabulary"
%word)
return returnvec
此段**就是將文字對應乙個詞集構建乙個只含0,1的向量,如果詞集中的詞出現在文字中,就將向量在該詞在詞集中的位置設為1,否則為0。這樣方便後面計算概率。
def
trainnb
(trainmatrix,traincategory)
: numtraindocs=
len(trainmatrix)
numwords=
len(trainmatrix[0]
) pausive=
sum(traincategory)
/float
(numtraindocs)
p0num=ones(numwords)
;p1num=ones(numwords)
#正確的初始化應該是p0num=zeros(numwords);p1num=zeros(numwords)
p0denom=
2.0;p1denom=
2.0#和 p0denom=0.0;p1denom=0.0,此處初始化為了減小乙個概率為0結果就是0的影響
for i in
range
(numtraindocs)
:if traincategory[i]==1
: p1num+=trainmatrix[i]
p1denom+=
sum(trainmatrix[i]
)else
: p0num+=trainmatrix[i]
p0denom+=
sum(trainmatrix[i]
) p1vect=p1num/p1denom
p0vect=p0num/p0denom
p0vect=
[log(x)
for x in p0vect]
p1vect=
[log(x)
for x in p1vect]
#概率就是p1num/p1denom和p0num/p0denom,此處是為了防止因子過小而得不到正確答案
return p0vect,p1vect,pausive
實現分類
def
classifynb
(vecclassify,p0vec,p1vec,pclass1)
: p1=
sum(vecclassify*p1vec)
+log(pclass1)
p0=sum(vecclassify*p0vec)
+log(
1.0-pclass1)
if p1>p0:
return
1else
:return
0def
testingnb()
: listposts,listclasses=loaddataset(
) myvocablist=createvocablist(listposts)
trainmat=
for postindoc in listposts:
) p0v,p1v,pab=trainnb(trainmat,listclasses)
testentry=
['love'
,'my'
,'dalmation'
] thisdoc=array(wordstovec(myvocablist,testentry)
)print
(testentry,
' classified as:'
,classifynb(thisdoc,p0v,p1v,pab)
) testentry=
['stupid'
,'garbage'
] thisdoc=array(wordstovec(myvocablist,testentry)
)print
(testentry,
' classified as:'
,classifynb(thisdoc,p0v,p1v,pab)
)
樸素貝葉斯
樸素貝葉斯演算法是一種基於概率統計的分類方法,它主要利用貝葉斯公式對樣本事件求概率,通過概率進行分類。以下先對貝葉斯公式做個了解。對於事件a b,若p b 0,則事件a在事件b發生的條件下發生的概率為 p a b p a b p b 將條件概率稍作轉化即可得到貝葉斯公式如下 p a b p b a ...
樸素貝葉斯
1.準備資料 從文字中構建詞向量 2.訓練演算法 從詞向量計算概率 3.測試演算法 儲存為 bayes.py 檔案 參考 coding utf 8 from numpy import 文字轉化為詞向量 def loaddataset postinglist my dog has flea probl...
樸素貝葉斯
機器學習是將資料轉化為決策面的過程 scikit learn縮寫為sklearn 訓練乙個分類器,學習之後 其處理的準確性 def nbaccuracy features train,labels train,features test,labels test from sklearn.bayes ...