pandas學習筆記(2) 異常值處理

2021-10-13 03:04:07 字數 3350 閱讀 8333

#先建立乙個資料集:(包含空資料)

df = dataframe(data = np.random.randint(0,150,size = (200,4)),

columns = ['python','english','math','chinese'])

for i in range(30):

index = np.random.randint(0,200,size = 1)[0]

column = np.random.randint(0,4,size = 1)[0]

df.iloc[index,column] = np.nan

dfout[14]:

python english math chinese

0 104.0 27.0 86.0 113.0

1 138.0 nan 132.0 126.0

2 3.0 113.0 37.0 64.0

3 47.0 110.0 32.0 100.0

4 87.0 29.0 126.0 144.0

.. ... ... ... ...

195 63.0 98.0 147.0 18.0

196 26.0 9.0 97.0 10.0

197 57.0 141.0 nan 37.0

198 52.0 78.0 26.0 143.0

199 79.0 46.0 134.0 86.0

#返回含有空值的行

cond = df.isnull().any(axis = 1)

df[cond]

out[15]:

python english math chinese

1 138.0 nan 132.0 126.0

5 136.0 21.0 146.0 nan

12 83.0 108.0 nan 49.0

17 6.0 nan 87.0 88.0

20 110.0 nan 18.0 145.0

21 36.0 nan 130.0 66.0

36 23.0 nan 74.0 46.0

37 nan 29.0 18.0 34.0

38 137.0 nan 67.0 34.0

40 15.0 85.0 82.0 nan

46 71.0 147.0 69.0 nan

65 105.0 105.0 12.0 nan

78 nan 56.0 3.0 23.0

83 89.0 101.0 133.0 nan

94 55.0 3.0 nan 149.0

99 92.0 nan 121.0 23.0

105 43.0 36.0 110.0 nan

108 75.0 91.0 nan 40.0

110 80.0 36.0 40.0 nan

119 36.0 nan 59.0 23.0

120 35.0 80.0 nan 121.0

128 34.0 18.0 24.0 nan

130 116.0 12.0 nan 28.0

142 93.0 2.0 nan 75.0

156 95.0 0.0 nan 36.0

170 nan 8.0 38.0 122.0

173 3.0 nan 28.0 59.0

175 nan 70.0 92.0 39.0

182 135.0 59.0 102.0 nan

197 57.0 141.0 nan 37.0

#返回所有有效資料(非空資料)

cond = df.notnull().all(axis = 1)

df[cond]

#方法2:

df.dropna()#刪除所有有空資料的行

#刪除某行或某列

df.drop(labels = ['english'],axis = 1)#刪除一列

cond = df.isnull().any(axis = 1)#刪除有空資料的行

index = df[cond].index

df.drop(labels = index)

#刪除小於60的行

cond = (df<60).any(axis = 1)

index = df[cond].index

df.drop(labels = index)

#找出平均分小於60的行或大於110的

cond1 = df.mean(axis = 1)<60

cond2 = df.mean(axis = 1)>110

cond3 = cond1|cond2#進行或運算,&是與運算

#np.nan可以參與到計算中,但計算結果總是nan,none則不能參與到計算中,會報錯

#填充空資料

df.fillna(60)

df.fillna(value = df.mean())#用平均值填充

#無論什麼樣的填充方法,都是假資料,但要讓它盡量真

df.fillna(method = 'backfill')

'''method : , default none

method to use for filling holes in reindexed series

pad / ffill: propagate last valid observation forward to next valid

backfill / bfill: use next valid observation to fill gap.

'''#還可以通過演算法填充或者區域性平均值填充

機器學習 異常值檢測

在生產生活中,由於裝置的誤差或者人為操作失當,產品難免會出現錯誤。然後檢查錯誤對人來說又是乙個十分瑣碎的事情。利用機器學習進行異常值檢測可以讓人類擺脫檢錯的煩惱。sum limits m sum limits m p x prod limits n sigma j 2 異常檢測演算法是乙個非監督學習...

SAS學習筆記(七) 關於異常值處理

sas中對於類別變數 離散 分布觀察用proc freq,對於連續變數則用proc univariate來完成。識別連續型變數的異常值通常稱為盒形圖,一般採用 proc univariate data train plot var variable run plot選項輸出變數的莖葉圖 觀測數少 或...

SAS學習之查詢異常值

1.查詢缺失值的萬能程式 data missing set sasuser.xb array cha character 利用 好不指定cha陣列中的字元型變數個數 do i 1 to dim cha 指定迴圈次數為陣列cha中的元素數 if missing cha i then output en...