機器學習演算法 2 K近鄰演算法實戰

前言：這篇是**部分，不會涉及原理部分的闡述，但整個程式的實現會分為2種，一種是純手工**，不用調庫，第二種方法是會借用sklearn的庫來實現。

這裡使用的k近鄰案例是我們比較熟悉的手寫數值的識別，其中我會把訓練資料、測試資料、程式放在乙個同一檔案下。

from numpy import *
from os import listdir
import operator
import time
#這裡是一段裝飾器，是為了測試程式的執行時間
def warrper():
starttime = time.time()
func()
pretime = time.time()
runtime = (pretime-starttime)
print("the running time:",runtime)
return warrper
# 計算距離然後對距離進行排序，取前k項較小的，並返回其中類別最多的乙個
def classify0(inx,dataset,labels,k):
datasetsize = dataset.shape[0]
#這裡說下，tile函式可以將inx在行上覆制datasetsize遍，在列上覆制1遍
diffmat = tile(inx,(datasetsize,1))-dataset
sqdiffmat=diffmat**2
sqdistance = sqdiffmat.sum(axis=1)
distance =sqdistance**0.5
sorteddistances = distance.argsort()
classcount = {}
for i in range(k):
votelable = labels[sorteddistances[i]]
classcount[votelable] = classcount.get(votelable,0)+1
sortedclasscount = sorted(classcount.items(),key = operator.itemgetter(1),reverse=true)
return sortedclasscount[0][0]
# 將影象格式處理為乙個向量
def img2vetor(filename):
returnvect = zeros((1,1024))
fr = open(filename)
for i in range(32):
linestr = fr.readline()
for j in range(32):
returnvect[0,32*i+j] = int(linestr[j])
return returnvect
# 呼叫訓練資料和測試資料
def handwritingclasstest():
hwlables = 
trainingfilelist = listdir('./trainingdigits')
m = len(trainingfilelist)
trainingmat = zeros((m,1024))
for i in range(m):
# print(trainingfilelist[i])
filenamestr = trainingfilelist[i]
filestr = filenamestr.split('.')[0]
classnumstr = int(filestr.split('_')[0])
trainingmat[i,:] = img2vetor('./trainingdigits/%s'%filenamestr)
testfilelist = listdir('./testdigits')
errorcount =  0
mtest = len(testfilelist)
for j in range(mtest):
testfilename = testfilelist[j]
testclassnum = int(testfilename.split('_')[0])
vectorundertest = img2vetor('./testdigits/%s'%testfilename)
classresult = classify0(vectorundertest,trainingmat,hwlables,3)
print('the classifier come back with:%d,the real answer is %d' %(classresult,testclassnum))
if classresult!=testclassnum:
errorcount +=1
print("\nthe totle error is %d" %errorcount)
print("\nthe totle error rate is %f"%(errorcount/float(mtest)))
handwritingclasstest()

程式執行結果：

the totle error is 10

the totle error rate is 0.010571

the running time: 38.14647126197815

process finished with exit code 0

from numpy import *
from os import listdir
from sklearn.neighbors import kneighborsclassifier
import time
def warrper():
starttime = time.time()
func()
pretime = time.time()
runtime = (pretime-starttime)
print(runtime)
return warrper
# 這一步是必須的，要把影象轉化成一維向量
def img2vector(filename):
returnvector = zeros((1,1024))
fr = open(filename)
for i in range(32):
linestr = fr.readline()
for j in range(32):
returnvector[0,32*i+j]=int(linestr[j])
return returnvector
# 獲取訓練了資料的影象資料，並轉化為向量
def training2vetor():
trainingfilelist = listdir('./trainingdigits')
m = len(trainingfilelist)
trainmat = zeros((m,1024))
hwlabels = 
for i in range(m):
trainmat[i, :] = img2vector('./trainingdigits/%s' % trainingfilelist[i])
trainingnum = int(trainingfilelist[i].split('_')[0])
return trainmat,hwlabels
# 對測試資料進行測試
def testclass():
clf = kneighborsclassifier(n_neighbors=3,algorithm='kd_tree',n_jobs=-1)
trainmat,hwlabels = training2vetor()
clf.fit(trainmat,hwlabels)
testclasslist = listdir('./testdigits')
mtest = len(testclasslist)
errorcount = 0
testlabels = 
for i in range(mtest):
testname = testclasslist[i]
testnum = int(testname.split('_')[0])
testvector = img2vector('./testdigits/%s'%testclasslist[i])
testresult=clf.predict(testvector)
if testresult!=testnum:
errorcount+=1
print("\nthe totle error is %d" % errorcount)
print("\nthe totle error rate is %f" % (errorcount / float(mtest)))
testclass()
# 執行完後，明顯發現呼叫庫比純手寫的代價執行效率要低，故安裝乙個裝飾器來對比兩個程式執行時間

程式執行結果：

the totle error is 12

the totle error rate is 0.012685

103.89552760124207

process finished with exit code 0

機器學習（2） K 近鄰演算法講解

定義採用測量不同特徵值之間的距離方法進行分類優點計算複雜度高空間複雜度高適用資料範圍數值型和標稱型工作原理在輸入乙個新資料後將新資料的每乙個特徵與樣本集中資料對應特徵進行比較，利用演算法提取樣本集中特徵最相似資料最近鄰的分類標籤，選取樣本資料集中前k個最相似資料 k一般小於20 ...

《機器學習實戰》 K 近鄰演算法

基本原理通過計算新資料與給定的樣本資料之間的距離，來確定相似度排名然後取前k個最相似的樣本，統計這k 一般不大於20 個樣本中出現最多的分類，設為新資料的分類。新資料，訓練樣本集，樣本資料標籤即分類最近鄰前k個最相似資料最近鄰標籤演算法實施首先提取要比較的特徵，確定各特徵的權重，進行...

機器學習實戰 K 近鄰演算法

簡單的說，k 近鄰演算法採用測量不同特徵值之間的距離辦法進行分類.收集資料可以使用任何方法。準備資料距離計算所需要的數值，最好是結構化的資料格式。分析資料可以使用任何方法。訓練演算法此步驟不適用於k 近鄰演算法。測試演算法計算錯誤率。使用演算法首先需要輸入樣本資料和結構化的輸出結果，然後...

機器學習演算法 2 K近鄰演算法實戰

機器學習（2） K 近鄰演算法講解

《機器學習實戰》 K 近鄰演算法

機器學習實戰 K 近鄰演算法

相關推薦