資料預處理

缺失值填充

獨熱變數

標準化分箱

預處理的重要性：無論是分類還是**，都可以提高準確率

字串轉數值型

字串分類變數一般都是object

x = rawdata.iloc[:,
0:-1
]x.dtypes 
"object"

for col in x.columns:
if x[col]
.dtype ==
"o":
x[col]
= pd.categorical(x[col]
).codes

好處是快，缺點是不知道對映的規則是啥。如果想要自定義——

x["marital"

]= x[

"marital"].

map或者:

x.replace(
dict
(primary=
1, secondary=
2, tertiary=3)
, inplace=
true
)dataset.loc[:,
"contact"
].replace(
dict
(cellular=
0, telephone=1)
, inplace=
true
)

如果類別特別多可以採用以下方法——

x["job"

]= x[

"job"].

mapprint

(enumerate

(set

(x["job"])

))可以通過set(x["job"])觀察對應關係，缺點是不能自定義

直接檢視資料集，replace即可

dataset[
"job"
].value_counts(
) dataset.loc[:,
"job"
].replace(np.nan,
"blue-collar"
, inplace=
true
)

必須在編碼之後完成。原理是取最相似的n個樣本在該屬性下的平均值。

from sklearn.impute import knnimputer
imputer = knnimputer(n_neighbors=5)
dataset_knn = pd.dataframe(imputer.fit_transform(dataset)
, columns=dataset.columns)

注意：

fit_transform得到新的np.array，需要重新賦值；

sklearn庫自帶的knnimputer似乎只能取平均值，不能取頻次最多的數。所以不適合分類變數。

對分類變數的處理：（四捨五入了一下）

dataset_knn.loc[:,
"education"
]= np.
round
(dataset_knn[
"education"
], decimals=
0)

注意：pandas dataframe dtype必須是「o」。字串可以，數值不行

把所有的object都變成獨熱變數

y = pd.get_dummies(y)

屬性取值多的話時間較長。

from sklearn.preprocessing import standardscaler
dataset.loc[:,
"age"
]= standardscaler(
).fit_transform(np.array(dataset[
"age"])
.reshape(-1
,1))

reshape(-1,1)的作用是把array打平成1列（2維array）

基本是按照經驗來分。cut等寬，qcut等頻

age = pd.cut(x[
"age"],
10, labels=[1
,2,3
,4,5
,6,7
,8,9
,10])
balance = pd.qcut(x[
"balance"],
5, labels=[1
,2,3
,4,5
])

資料預處理

現實世界中資料大體上都是不完整，不一致的髒資料，無法直接進行資料探勘，或挖掘結果差強人意。為了提前資料探勘的質量產生了資料預處理技術。資料預處理有多種方法資料清理，資料整合，資料變換，資料歸約等。這些資料處理技術在資料探勘之前使用，大大提高了資料探勘模式的質量，降低實際挖掘所需要的時間。一資料清...

資料預處理

常見的資料預處理方法，以下通過sklearn的preprocessing模組來介紹變換後各維特徵有0均值，單位方差。也叫z score規範化零均值規範化計算方式是將特徵值減去均值，除以標準差。sklearn.preprocessing scale x 一般會把train和test集放在一起做標...

資料預處理

用cut函式分箱有時把數值聚集在一起更有意義。例如，如果我們要為交通狀況路上的汽車數量根據時間分鐘資料建模。具體的分鐘可能不重要，而時段如上午下午傍晚夜間深夜更有利於如此建模更直觀，也能避免過度擬合。這裡我們定義乙個簡單的可復用的函式，輕鬆為任意變數分箱。def binni...

資料預處理

資料預處理

資料預處理

資料預處理

相關推薦