sklearn 缺失值處理器 Imputer

class sklearn.preprocessing.imputer(missing_values=』nan』, strategy=』mean』, axis=0, verbose=0, copy=true)

引數：axis: 預設為 axis=0

說實話，我還是沒太弄明白aixs的具體含義，總感覺在不同的函式中有不同的含義。。還是使用前查詢一下官方文件吧，畢竟大多數時候處理的都是2維陣列,文件中的引數很容易理解。

注意：imputer 只接受dataframe型別

dataframe 中必須全部為數值屬性

所以在處理的時候注意，要進行適當處理。

數值屬性的列較少，可以將數值屬性的列取出來單獨取出來

import
pandas as pd
import
numpy as np
df=pd.dataframe([["
xxl", 8, "
black
", "
class 1
", 22],["
l", np.nan, "
gray
", "
class 2
", 20],["
xl", 10, "
blue
", "
class 2
", 19],["
m", np.nan, "
orange
", "
class 1
", 17],["
m", 11, "
green
", "
class 3
", np.nan],["
m", 7, "
red", "
class 1
", 22]])
df.columns=["
size
", "
price
", "
color
", "
class
", "
boh"
]print
(df)
#out:
'''size  price   color    class   boh
0  xxl    8.0   black  class 1  22.0
1    l    nan    gray  class 2  20.0
2   xl   10.0    blue  class 2  19.0
3    m    nan  orange  class 1  17.0
4    m   11.0   green  class 3   nan
5    m    7.0     red  class 1  22.0
'''from sklearn.preprocessing import
imputer
#1. 建立imputer器
imp =imputer(missing_values="
nan", strategy="
mean
",axis=0 )
#先只將處理price列的資料， 注意使用的是   df[['price']]   這樣返回的是乙個dataframe型別的資料！！！！
#2. 使用fit_transform()函式即可完成缺失值填充了
df["
price
"]=imp.fit_transform(df[["
price
"]])df#
out:
'''size    price    color    class    boh
0    xxl    8.0    black    class 1    22.0
1    l    9.0    gray    class 2    20.0
2    xl    10.0    blue    class 2    19.0
3    m    9.0    orange    class 1    17.0
4    m    11.0    green    class 3    nan
5    m    7.0    red    class 1    22.0
'''#
直接處理price和boh兩列
df[['
price
', '
boh']] = imp.fit_transform(df[['
price
', '
boh'
]])df
#out:
'''size    price    color    class    boh
0    xxl    8.0    black    class 1    22.0
1    l    9.0    gray    class 2    20.0
2    xl    10.0    blue    class 2    19.0
3    m    9.0    orange    class 1    17.0
4    m    11.0    green    class 3    20.0
5    m    7.0    red    class 1    22.0
'''

數值屬性的列較多，相反文字或分類屬性（text and category attribute)較少，可以先刪除文字屬性，處理完以後再合併

from sklearn.preprocessing import
imputer
#1.建立iimputer
imputer = imputer(strategy="
median")
#只有乙個文字屬性，故先去掉
housing_num = housing.drop("
ocean_proximity
", axis=1)
#2. 使用fit_transform函式
x =imputer.fit_transform(housing_num)
#返回的是乙個numpyarray，要轉化為dataframe
housing_tr = pd.dataframe(x, columns=housing_num.columns)
#將文字屬性值新增
housing_tr['
ocean_proximity
'] = housing["
ocean_proximity"]
housing_tr[:2]
#out：
'''longitude    latitude    housing_median_age    total_rooms    total_bedrooms    population    households    median_income
0    -121.89     37.29         38.0                    1568.0        351.0         710.0         339.0        2.7042
1    -121.93        37.05       14.0                  679.0            108.0         306.0       113.0       6.4214
'''

sklearn處理缺失值

匯入包 from sklearn.impute import sampleimpute先將一列資料初始化為乙個二維的 data age data.loc age values.reshape 1 1 開始填補缺失值 imp mean impute 預設用0填補 imp median impute s...

sklearn 資料缺失值處理

在sklearn的preprocessing包中包含了對資料集中缺失值的處理，主要是應用imputer類進行處理。首先需要說明的是，numpy的陣列中可以使用np.nan np.nan not a number 來代替缺失值，對於陣列中是否存在nan可以使用np.isnan 來判定。使用type n...

sklearn 資料填補缺失值

機器學習和資料探勘中所使用的資料，永遠不可能是完美的。很多特徵，對於分析和建模來說意義非凡，但對於實際收集資料的人卻不是如此，因此資料探勘之中，常常會有重要的字段缺失值很多，但又不能捨棄欄位的情況。因此，資料預處理中非常重要的一項就是處理缺失值。從kaggle中簡單的獲取的鐵達尼號的遇難者生存資...

sklearn 缺失值處理器 Imputer

sklearn處理缺失值

sklearn 資料缺失值處理

sklearn 資料填補缺失值

相關推薦