資料預處理流程梳理（1）單要素處理

對於資料型別的單要素變數可以採用如下方式進行初步解析

變數分析

# dataframe 函式中有可以進行百分位數計算函式
df.quantile(0.1) 
# 箱線圖
import seaborn as sns
sns.boxplot(df['v'], orient='v', width=0.5)

一般採用直方圖和q-q圖來檢測資料是否符合正態分佈

import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
ax = plt.subplot(1,2,1)
# 繪製直方圖，並繪製擬合曲線
sns.distplot(df[columnname], fit=stats.norm)  
ax = plt.subplot(1,2,2)
res = stats.prodplot(df[columnname]，plot[plt)

單要素預處理

from scipy import stats
def box_cox_trans(data):
'''box-cox 轉化
'''data = data.dropna(axis=0)
data = data+1
lamdic = {}
for icol in data.columns:
data[icol] = dat[icol][0]
lamdic[icol] = dat[icol][1]
return data, lamdic

from statsmodels.tsa.seasonal import seasonal_decompose
def ts_decopose(timeseries):
'''對時間序列進行分解
timeseries：
series index 為 timetamp
'''#timeseries = ser_2_df(timeseries)
decomposition = seasonal_decompose(
timeseries.values, 
period=24,   # 預設週期
model="additive") #multiplicative,additive 乘法模型和加法模型
trend = decomposition.trend         
seasonal = decomposition.seasonal
residual = decomposition.resid
return trend, seasonal, residual

資料預處理總結1

使資料服從標準正態分佈，均值為0，方差為1。做資料探勘，pandas和numpy庫肯定是要引入的，這裡就省略了。from sklearn.preprocessing import standardscaler import warnings warnings.filterwarnings ignor...

資料預處理資料清洗（1）缺失值處理

缺失值，異常值和重複值的處理 3.1.1缺失值處理缺失值處理方式填充缺失值相對直接刪除而言，用適當方式填充缺失值，形成完整的資料記錄是更加常用的缺失值處理方式。常用的填充方法如下不處理資料分析和建模應用中很多模型對於缺失值有容忍度或靈活的處理方法，因此在預處理階段可以不做處理。常見的能夠自動處...

資料預處理

現實世界中資料大體上都是不完整，不一致的髒資料，無法直接進行資料探勘，或挖掘結果差強人意。為了提前資料探勘的質量產生了資料預處理技術。資料預處理有多種方法資料清理，資料整合，資料變換，資料歸約等。這些資料處理技術在資料探勘之前使用，大大提高了資料探勘模式的質量，降低實際挖掘所需要的時間。一資料清...

資料預處理流程梳理（1） 單要素處理

資料預處理總結1

資料預處理 資料清洗（1） 缺失值處理

資料預處理

相關推薦

資料預處理流程梳理（1）單要素處理

資料預處理資料清洗（1）缺失值處理