Python資料探勘入門與實踐 OneR分類演算法

oner演算法是根據已有的資料中，具有相同特徵值的個體最可能屬於哪個類別進行分類。

在本例中，只需選區iris是個特徵中分類效果最好的乙個作為分類依據。

資料集的特徵為連續值，把連續值轉變為類別行，這個過程叫作離散化。

每條資料集中給出了四個特徵：sepal length,sepal width,petal length,petal width，可以從sklearn-learn使用該資料集

from sklearn.datasets import load_iris
import numpy as np
dataset = load_iris()
print(dataset.descr)
x = dataset.data
y = dataset.target

使用離散化演算法確定資料，通過確定乙個閾值，將低於該閾值的特徵值置為0，高於閾值的置為1，設每個特徵的閾值為所有特徵的均值，計算方法如下：

attribute_means = x.mean()

進行型別轉換

x_d = np.array(x >= attribute_means, dtype='int')

通過oner演算法，我們將計算按照每個特徵進行分類的錯誤率，然後選區錯誤率最低的特徵作為分類準則，

首先建立函式，引數分別是資料集，類別陣列，選好的特徵索引值，特徵值

from collections import defaultdict
from operator import itemgetter
def train_feature_value(x, y_true, feature_index, value):
class_counts = defaultdict(int)
for sample, y in zip(x, y_true):
if sample[feature_index] == value:
class_counts[y] += 1
sorted_class_counts = sorted(class_counts.items(),key=itemgetter(1), reverse=true)
most_frequent_class = sorted_class_counts[0][0]
incorrect_predictions = [class_count for class_value,class_count in class_counts.items() if class_value != most_frequent_class]
error = sum(incorrect_predictions)
return most_frequent_class, error

對於任意一項特徵，遍歷其中每乙個特徵值使用上述函式計算錯誤率。

def train_on_feature(x, y_true, feature_index):
values = set(x[:,feature_index])
#字典predictors作為**器，字典的鍵位特徵值，值為類別
#errors表示每個特徵值的錯誤率
predictors = {}
errors = 
#呼叫函式記錄每個特徵值可能的類別，計算錯誤率並儲存到predictor中
for current_value in values:
most_frequent_class, error = train_feature_value(x,y_true, feature_index, current_value)
predictors[current_value] = most_frequent_class
total_error = sum(errors)
return predictors, total_error

切割資料集

from sklearn.cross_validation import train_test_split
xd_train, xd_test, y_train, y_test = train_test_split(x_d, y, random_state=14)

接下來計算所有特徵值的目標類別（**器）

all_predictors = {}
errors = {}
for feature_index in range(xd_train.shape[1]):
predictors, total_error = train_on_feature(xd_train,y_train,feature_index)
all_predictors[feature_index] = predictors
errors[feature_index] = total_error

找出錯誤率最低的特徵，作為分類準則

best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

對**器進行排序，找出最佳特徵並建立model模型，建立函式通過遍歷資料集中每條資料完成**

model = 
def predict(x_test, model):
feature = model['feature']
predictor = model['predictor']
y_predicted = [predictor[int(sample[feature])] for sample in x_test]
return y_predicted

比較結果和實際類別進行**，得出正確率

oner演算法的思路很簡單，但是整個程式設計過程對於新手來說比較複雜，還是需要加強演算法和資料結構方面的練習。

《Python資料探勘概念方法與實踐》導讀

contents 目錄譯者序關於審稿人前言第1章擴充套件你的資料探勘工具箱 1.1 什麼是資料探勘 1.2 如何進行資料探勘 1.2.1 fayyad等人的kdd過程 1.2.2 韓家煒等人的kdd過程 1.2.3 crisp dm過程 1.2.4 六步過程 1.2.5 哪一種資料探勘方法...

《Python資料探勘概念方法與實踐》一導讀

preface 前言過去十年，資料儲存變得更便宜，硬體變得更快，演算法上也有了引人注目的進步，這一切為資料科學的快速興起鋪平了道路，並推動其發展成為計算領域最重要的機遇。雖然資料科學一詞可以包含從資料清理資料儲存到用圖形圖表視覺化資料的所有環節，但該領域最重要的收穫是發明了智慧型精密的資料...

資料探勘入門系列 Python快速入門

本次入門系列將使用python作為開發語言。要使用python語言，我們先來搭建python開發平台。我們將基於python 2.7版本以及python的開發發行版本anaconda版本來開發。anaconda指的是乙個開源的python發行版本，其包含了conda python等180多個科學包...

Python資料探勘入門與實踐 OneR分類演算法

《Python資料探勘 概念 方法與實踐》導讀

《Python資料探勘 概念 方法與實踐》一導讀

資料探勘入門系列 Python快速入門

相關推薦

《Python資料探勘概念方法與實踐》導讀

《Python資料探勘概念方法與實踐》一導讀