樸素貝葉斯
比如我們想判斷乙個郵件是不是垃圾郵件,那麼我們知道的是這個郵件中的詞的分布,那麼我們還要知道:垃圾郵件中某些詞的出現是多少,就可以利用貝葉斯定理得到。
樸素貝葉斯分類器中的乙個假設是:每個特徵同等重要
loaddataset()
createvocablist(dataset)
setofwords2vec(vocablist, inputset)
bagofwords2vecmn(vocablist, inputset)
trainnb0(trainmatrix,traincatergory)
classifynb(vec2classify, p0vec, p1vec, pclass1)
1#coding=utf-8
2from numpy import *
3def
loaddataset():
4 postinglist=[['
my', '
dog', '
has', '
flea
', '
problems
', '
help
', '
please'],
5 ['
maybe
', '
not', '
take
', '
him', '
to', '
dog', '
park
', '
stupid'],
6 ['
my', '
dalmation
', '
is', '
so', '
cute
', '
i', '
love
', '
him'
],7 ['
stop
', '
posting
', '
stupid
', '
worthless
', '
garbage'],
8 ['
mr', '
licks
', '
ate', '
my', '
steak
', '
how', '
to', '
stop
', '
him'
],9 ['
quit
', '
buying
', '
worthless
', '
dog', '
food
', '
stupid']]
10 classvec = [0,1,0,1,0,1] #
1 is abusive, 0 not
11return
postinglist,classvec
1213
#建立乙個帶有所有單詞的列表
14def
createvocablist(dataset):
15 vocabset =set()
16for document in
dataset:
17 vocabset = vocabset |set(document)
18return
list(vocabset)
1920
defsetofwords2vec(vocablist, inputset):
21 retvocablist = [0] *len(vocablist)
22for word in
inputset:
23if word in
vocablist:
24 retvocablist[vocablist.index(word)] = 1
25else:26
'word
',word ,'
not in dict'27
return
retvocablist
2829
#另一種模型
30def
bagofwords2vecmn(vocablist, inputset):
31 returnvec = [0]*len(vocablist)
32for word in
inputset:
33if word in
vocablist:
34 returnvec[vocablist.index(word)] += 1
35return
returnvec
3637
deftrainnb0(trainmatrix,traincatergory):
38 numtraindoc =len(trainmatrix)
39 numwords =len(trainmatrix[0])
40 pabusive = sum(traincatergory)/float(numtraindoc)41#
防止多個概率的成績當中的乙個為0
42 p0num =ones(numwords)
43 p1num =ones(numwords)
44 p0denom = 2.0
45 p1denom = 2.0
46for i in
range(numtraindoc):
47if traincatergory[i] == 1:
48 p1num +=trainmatrix[i]
49 p1denom +=sum(trainmatrix[i])
50else
:51 p0num +=trainmatrix[i]
52 p0denom +=sum(trainmatrix[i])
53 p1vect = log(p1num/p1denom)#
處於精度的考慮,否則很可能到限歸零
54 p0vect = log(p0num/p0denom)
55return
p0vect,p1vect,pabusive
5657
defclassifynb(vec2classify, p0vec, p1vec, pclass1):
58 p1 = sum(vec2classify * p1vec) + log(pclass1) #
element-wise mult
59 p0 = sum(vec2classify * p0vec) + log(1.0 -pclass1)
60if p1 >p0:
61return 1
62else
: 63
return064
65def
testingnb():
66 listoposts,listclasses =loaddataset()
67 myvocablist =createvocablist(listoposts)
68 trainmat=
69for postindoc in
listoposts:
7071 p0v,p1v,pab =trainnb0(array(trainmat),array(listclasses))
72 testentry = ['
love
', '
my', '
dalmation']
73 thisdoc =array(setofwords2vec(myvocablist, testentry))
74print testentry,'
classified as:
',classifynb(thisdoc,p0v,p1v,pab)
75 testentry = ['
stupid
', '
garbage']
76 thisdoc =array(setofwords2vec(myvocablist, testentry))
77print testentry,'
classified as:
',classifynb(thisdoc,p0v,p1v,pab)
7879
80def
main():
81testingnb()
8283
if__name__ == '
__main__':
84 main()
來自為知筆記(wiz)
樸素貝葉斯演算法python
樸素貝葉斯演算法的工作原理主要是概率論和數理統計 通過屬性對分類的影響程度,所展現不同的結果 import numpy as np x np.array 0,1,0,1 1,1,1,0 0,1,1,0 0,0,0,1 0,1,1,0 0,1,0,1 1,0,0,1 y np.array 0,1,1,...
樸素貝葉斯演算法
首先樸素貝葉斯分類演算法利用貝葉斯定理來 乙個未知類別的樣本屬於各個類別的可能性,選擇可能性最大的乙個類別作為該樣本的最終類別。對於計算而言,需要記住乙個公式。p c x p x c p c p x 類似於概率論條件概率公式。其中x代表含義為最終類別,例如 yes?no。c為屬性。使用樸素貝葉斯演算...
樸素貝葉斯演算法
計算貝葉斯定理中每乙個組成部分的概率,我們必須構造乙個頻率表 因此,如果電子郵件含有單詞viagra,那麼該電子郵件是垃圾郵件的概率為80 所以,任何含有單詞viagra的訊息都需要被過濾掉。當有額外更多的特徵時,此概念的使用 利用貝葉斯公式,我們可得到如下概率 分母可以先忽略它,垃圾郵件的總似然為...