機器學習演算法02 K近鄰演算法實戰

摘要：本文主要介紹knn演算法的原理和實列。包括演算法描述、演算法缺點、演算法概述。

若乙個樣本在特徵空間中的k個最相似（即特徵空間中最鄰近）的樣本的大多數是屬於類別a,則該樣本也是屬於a類。

已存在乙個帶標籤的資料庫，對輸入沒有標籤的新資料後。將新資料的每個特徵與樣本集中資料對應的特徵進行比較，然後演算法提取樣本集中特徵最相似（最鄰近）的分類標籤。只選擇前k個最相似的資料，然後從k個裡選**現次數最多的分類。

1.計算已知類別資料集中的點與當前點之間的距離。

2.按照距離遞增次序排序。

3.取與當前距離最小的前k個點

4.確定前k個點所在類別的出現頻率。

5.返回前k個點**現頻率最高類別作為當前點的**分類

（我使用的是jupyter,它比較好用，且不需要自己導包，使用了import就可以用該包）

1.匯入資料集，從sklearn.datasets自帶的資料集中匯入

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
x = iris.data
y = iris.target

2.封裝knn演算法類

import numpy as np
from math import sqrt
from collections import counter
class knnclassifier:
def __init__(self,k):
"""初始化knn分類器"""
assert k >= 1, "k must be valid"
self.k = k
self._x_train = none
self._y_train = none
def fit(self,x_train,y_train):
"""根據訓練集x_train,y_train訓練knn分類器"""
assert x_train.shape[0] == y_train.shape[0], "the size of x_train must equal to the size of y_train"
assert self.k <= x_train.shape[0], "the size of x_train must be at least k"
self._x_train = x_train
self._y_train = y_train
return self
def predict(self,x_predict):
"給定待**的資料集x_predict,返回表示x_predict的結果向量"
assert self._x_train is not none and self._y_train is not none, "must fit before predict"
assert x_predict.shape[1] == self._x_train.shape[1], "the feature number of x_predict must equal to x_train"
y_predict = [self._predict(x) for x in x_predict]
return np.array(y_predict)
def _predict(self,x):
"""給定單個待**資料x,返回x的**結果"""
assert x.shape[0] == self._x_train.shape[1], "the feature number of x must equal to x_train"
distances = [sqrt(np.sum(i-x)**2) for i in self._x_train]
nearset = np.argsort(distances)
topk_y = [self._y_train[i] for i in nearset[:self.k]]
votes = counter(topk_y)
return votes.most_common(1)[0][0]
def _repr_(self):
return "knn(k=%d)" % self.k

3.分割資料以及訓練和** 這裡k取20的測試準確率比較高

knn = knnclassifier(k=20)
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.2,random_state=0)
knn.fit(train_x,train_y)
result = knn.predict(test_x)
display(np.sum(result==test_y))

4.對上面的資料視覺化展示

#資料視覺化
import matplotlib as mpl
import matplotlib.pyplot as plt
#設定畫布大小
plt.figure(figsize=(10,10))
#繪製訓練集結果 只取第0列和第1列的資料顯示
plt.scatter(train_x[train_y==0,2],train_x[train_y==0,3],color='y',label='iris-virginica')
plt.scatter(train_x[train_y==1,2],train_x[train_y==1,3],color='orange',label='iris-setosa')
plt.scatter(train_x[train_y==2,2],train_x[train_y==2,3],color='pink',label='iris-versicolor')
right = test_x[result==test_y]
wrong = test_x[result!=test_y]
#繪製測試集正確結果
plt.scatter(test_x[test_y==0,2],test_x[test_y==0,3],color='b',label='l-iris-virginica')
plt.scatter(test_x[test_y==1,2],test_x[test_y==1,3],color='r',label='l-iris-setosa')
plt.scatter(test_x[test_y==2,2],test_x[test_y==2,3],color='cyan',label='l-iris-versicolor')
#繪製knn模型**結果
plt.scatter(test_x[result==0,2],test_x[result==0,3],color='g',label='p-iris-virginica',marker='>')
plt.scatter(test_x[result==1,2],test_x[result==1,3],color='g',label='p-iris-setosa',marker='+')
plt.scatter(test_x[result==2,2],test_x[result==2,3],color='g',label='p-iris-versicolor',marker='*')
plt.title('knn model result')
plt.xlabel('length')
plt.ylabel('width')
plt.legend(loc='best')

5.效果圖（可能需放大看的更清楚）

1.這個是取前兩個維度展示的結果；結合上面**，其中iris開頭的是訓練集結果；l-iris開頭的是測試集正確結果；p-iris開頭的是knn模型**結果；從下圖可知，30個測試樣本有三個是分類錯誤的，其準確率是27/30，達到90%以上。

2.這個是取後兩個維度展示的結果。（與上面**相符）

機器學習演算法02 K近鄰演算法實戰

機器學習 02 K近鄰模型

機器學習實戰 02 k臨近

《機器學習實戰》 K 近鄰演算法

機器學習演算法02 K近鄰演算法實戰

機器學習 02 K近鄰模型

機器學習實戰 02 k臨近

《機器學習實戰》 K 近鄰演算法

相關推薦