用python實現ID3演算法

# coding: utf-8
from math import log
import operator
"""資料樣本：
編號         用腮呼吸         是否有腳蹼          屬於魚類
1              是               是                 是
2              是               否                 是
3              是               否                 否
4              否               是                 否
5              否               是                 否
"""def createdataset():
dataset = [
[1, 1, 'yes'],
[1, 0, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']
]labels = ['gill', 'filppers']
return dataset, labels
def calshannonent(dataset):
numentries = len(dataset)
labelcounts = {}
for featvec in dataset:
currentlabel = featvec[-1]
if currentlabel not in labelcounts.keys():
labelcounts[currentlabel] = 0
labelcounts[currentlabel] += 1
shannonent = 0
for key in labelcounts:
prob = float(labelcounts[key]) / numentries
shannonent -= prob * log(prob, 2)
return shannonent
def splitdataset(dataset, axis, value):
retdataset = 
for featvec in dataset:
if featvec[axis] == value:
reducedfeatvec = featvec[:axis]
reducedfeatvec.extend(featvec[axis+1:])
return retdataset
def choosebestfeaturetosplit(dataset):
numfeatures = len(dataset[0]) - 1
baseentropy = calshannonent(dataset)
bestinfogain = 0.0
bestfeature = -1
for i in range(numfeatures):
# 提取了第一列特徵值
featlist = [example[i] for example in dataset]
uniquevals = set(featlist)
newentropy = 0.0
for value in uniquevals:
# 對i個特徵值根據value進行資料集的劃分
subdataset = splitdataset(dataset, i, value)
prob = len(subdataset) / float(len(dataset))
# 對照西瓜書，可以知道每個特徵值資訊量求解的公式如下
newentropy += prob * calshannonent(subdataset)
infogain = baseentropy - newentropy
if infogain > bestinfogain:
bestinfogain = infogain
# 會告訴你根據哪個特徵劃分是最好的選擇
bestfeature = i
return bestfeature
# 投票機，選出佔比最大的那個類別
def majoritycnt(classlist):
classcount = {}
for vote in classlist:
if vote not in classcount.keys():
classcount[vote] = 0
classcount[vote] += 1
sortedclasscount = sorted(zip(tuple(classcount), tuple(classcount.values())),
key=operator.itemgetter(1), reverse=true)
return sortedclasscount[0][0]
def createtree(dataset, labels):
classlist = [example[-1] for example in dataset]
# 如果全是一樣的結果就沒有計算的必要，例如['no', 'no', 'no']，此時另乙個分支還存在子節點
if classlist.count(classlist[0]) == len(classlist):
return classlist[0]
# 如果只有乙個特徵，那就直接返回特徵最多的那個分類，到最後一層使用，例如['yes', 'no', 'no']，最後一次分類觸發
if len(dataset[0]) == 1:
return majoritycnt(classlist)
# 建立決策樹的根節點分類
bestfeat = choosebestfeaturetosplit(dataset)
bestfeatlabel = labels[bestfeat]
mytree = }
del(labels[bestfeat])
featvalues = [example[bestfeat] for example in dataset]
uniquevals = set(featvalues)
# 進入下一層節點進行分類
for value in uniquevals:
sublabels = labels
mytree[bestfeatlabel][value] = createtree(splitdataset(dataset, bestfeat, value), sublabels)
return mytree
dataset, labels = createdataset()
# print(calshannonent(dataset))
# print(choosebestfeaturetosplit(dataset))
# print(majoritycnt(['yes', 'no']))
print(createtree(dataset, labels))

ID3演算法的Python實現

本篇文章的是在 id3演算法的原理及實現 python 的基礎上進行新增和修改實現的，感謝原作者。1 新增的功能 1 拆分檔案，使得函式的呼叫更加清晰 2 增加了gui，增加了資料的讀取和功能 3 增加了乙個遞迴終止條件 2 gui介面展示以檔案中給出的資料集為例，填充如下注這裡類標籤的位...

ID3演算法Java實現

1.1 資訊熵熵是無序性或不確定性的度量指標。假如事件a 的全概率劃分是 a1,a2,an 每部分發生的概率是 p1,p2,pn 那資訊熵定義為通常以2 為底數，所以資訊熵的單位是 bit。1.2 決策樹決策樹是以例項為基礎的歸納學習演算法。它從一組無次序無規則的元組中推理出決策樹表示形...

java實現ID3演算法

id3是經典的分類演算法，要理解id3演算法，需要先了解一些基本的資訊理論概念，包括資訊量，熵，後驗熵，條件熵。id3演算法的核心思想是選擇互資訊量最大的屬性作為分割節點，這樣做可以保證所建立的決策樹高度最小。樹結構 c4.5決策樹資料結構 author zhenhua.chen descripti...

用python實現ID3演算法

ID3演算法的Python實現

ID3演算法Java實現

java實現ID3演算法

相關推薦