sklearn工具學習筆記4

sklearn資料集變換 dataset transformation：

1、pipeline and featureunion：並行或序列組合變換器

1)pipeline ：可以用於把多個estimators級聯為乙個estimator，pipeline物件使用(key,value)列表來構建，其中k是乙個標識步驟的名稱字串，值是乙個estimator物件，也可以使用make_pipeline，是乙個構造pipeline的簡短工具，它接受可變數量的estimators並返回乙個pipeline，每個estimator的名稱自動填充。

form sklearn.pipeline import pipeline

estimators=

pipe=pipeline(estimators)

pipe.setps/pipe.named_steps[『name』]

pipe.set_params()

params=dict()

grid_search=gridsearchcv(pipe,param=params)：聯合超引數優化

2)featureunion：組合特徵空間，把若干transformer objects組合成乙個新的transformer，這個新的transformer組合了他們的輸出。在資料變換階段，所有訓練好的transformer可以並行執行。featureunion和pipeline可以共同使用建立更加複雜的模型。featureunion物件例項使用(key,value)構成的list來構造，k是乙個名稱字串，值是乙個estimator物件。與pipeline類似，featuresunions也有make_union簡單的構造方法，不需要給每個estimator指定名稱。

form sklearn.pipeline import featureunion

estimators=[(『linear_pca』,pca()),(『kernel_pca』,kernelpca())]

combined=featureunion(estimators)

combined.set_params()

2、feature extraction：特徵提取

form sklearn .feature_extraction import dictvectorizer

vec= dictvectorizer()

vec.fit_transform(measurements).toarray()

dictvectorizer實現了』one-hot』

詞袋模型：

tokenizing：標記字串

counting：統計詞頻

normalizing：歸一化加權

x=vectorizer.fit_transform(corpus)

analyze=vectorizer.bulid_analyzer()

vectorizer.vocabulary_.get(『word』)：得到該單詞位於特徵空間的第幾列

countvectorizer(ngram_range=(1,2))：指定2-grams of words 或1-grams

3、data preprocessing：資料預處理

1)standardization

from sklearn import preprocessing

x_scaled=preprocessing.scale(x)

x_scaled.mean(axis=0)

x_scaled.std(axis=0)

min_max_scaler=preprocessing.minmaxscaler()

x_train_minmax=min_max_scaler.fit_transform(x_train)

robusscaler：使用更加魯棒的方式估計資料中心和範圍，但不能處理稀疏矩陣

以上如果不操作物件也可直接呼叫相應函式：

minmax_scale\maxabs_scale\scale\robust_scale

2)normalization

正則化針對的是每行，或者說每個樣本的不同特徵，一般計算樣本之間的距離時使用其做歸一化處理。normalization主要思想是對每個樣本計算其p-範數，然後對該樣本中每個元素除以該範數，這樣處理的結果是使得每個處理後樣本的p-範數（l1,l2）等於1。

preprocessing.normalize(x,norm=』l2』)

3)binarization

feature binarization：將數值型特徵取值閾值化轉變為布林型特徵取值，這一過程主要是為概率型學習器提供資料預處理機制

4)encoding categorical features

enc = preprocessing.onehotencoder()

enc.fit()

enc.transform().toarray()

5)imputation of missing values

imputer class 提供了補全缺失值的基本策略，使用一行或一列的均值，中值，出現次數最多的值

from sklearn.preprossing import imputer

imp=imputer (missing_value=』nan』,stratey=』mean』,axis=0)

imp.fit([np.nan,3])

6)generating ploynomial features

from sklearn preprocessing import ploynomialfeaturns

x=np.arange(6).reshape(3,2)

ploy =ploynomialfeatures(2)

ploy.fit_transform(x)

如只要互乘積項，設定引數』interaction_only=true』

7)custom transformers

借助於functiontransformer類

from sklearn.preprocessing import function tansformer

transformer=functiontransformer(np.log1p)

x=np.array

transformer.transform(x)

sklearn工具學習筆記4

sklearn工具學習筆記1

sklearn學習筆記

sklearn學習筆記

sklearn工具 學習筆記4

sklearn工具 學習筆記1

sklearn學習筆記

sklearn學習筆記

相關推薦

sklearn工具學習筆記4

sklearn工具學習筆記1