sklearn 中的 Pipeline 機制

管道機制在機器學習

演算法中得以應用的根源在於，引數集在新資料集（比如測試集）上的重複使用。

管道機制實現了對全部步驟的流式化封裝和管理（streaming workflows with pipelines）。

注意：管道機制更像是程式設計技巧的創新，而非演算法的創新。

接下來我們以乙個具體的例子來演示sklearn庫中強大的pipeline用法：

from pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import labelencoder
df = pd.read_csv(''
'breast-cancer-wisconsin/wdbc.data', header=none)
# breast cancer wisconsin dataset
x, y = df.values[:, 2:], df.values[:, 1]
# y為字元型標籤
# 使用labelencoder類將其轉換為0開始的數值型
encoder = labelencoder()
y = encoder.fit_transform(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

可放在pipeline中的步驟可能有：

from sklearn.preprocessing import standardscaler
from sklearn.decomposition import pca
from sklearn.linear_model import logisticregression
from sklearn.pipeline import pipeline
pipe_lr = pipeline([('sc', standardscaler()),
('pca', pca(n_components=2)),
('clf', logisticregression(random_state=1))
])pipe_lr.fit(x_train, y_train)
print('test accuracy: %.3f' % pipe_lr.score(x_test, y_test))
# test accuracy: 0.947

pipeline物件接受二元tuple構成的list，每乙個二元 tuple 中的第乙個元素為 arbitrary identifier string，我們用以獲取（access）pipeline object 中的 individual elements，二元 tuple 中的第二個元素是 scikit-learn與之相適配的transformer 或者 estimator。

pipeline([('sc', standardscaler()), ('pca', pca(n_components=2)), ('clf', logisticregression(random_state=1))])

pipeline 的中間過程由scikit-learn相適配的轉換器（transformer）構成，最後一步是乙個estimator。比如上述的**，standardscaler和pca

transformer 構成intermediate steps，logisticregression 作為最終的estimator。

當我們執行pipe_lr.fit(x_train, y_train)時，首先由standardscaler在訓練集上執行 fit和transform方法，transformed後的資料又被傳遞給pipeline物件的下一步，也即pca()。和standardscaler一樣，pca也是執行fit和transform方法，最終將轉換後的資料傳遞給losigsticregression。整個流程如下圖所示：

只不過步驟（step）的概念換成了層（layer）的概念，甚至the last step 和輸出層的含義都是一樣的。

只是丟擲乙個問題，是不是有那麼一丟丟的相似性？

sklearn 中的 Pipeline 機制

sklearn 中的 Pipeline 機制

sklearn中的Lasso函式

sklearn中GridSearch的使用

sklearn 中的 Pipeline 機制

sklearn 中的 Pipeline 機制

sklearn中的Lasso函式

sklearn中GridSearch的使用

相關推薦