資料探勘之資料探索

本文探索：

1. 探索類別特徵，檢視每個類別特徵有多少種類

2. 探索數值特徵，離散化方式

3. 去除大多數是同一值的特徵

4. 處理時間型特徵

所需python包

from pandas import series, dataframe
import pandas as pd

一、檢視每個類別特徵有多少種類

def
findnumofcatfeacture
(data, feacture_cols, flag_dropcat = 50):
'''    函式說明：尋找每乙個類別特徵有多少種種類, 及去除種類多的特徵
輸入：data——整個資料集，包括index，target
feacture_cols——特徵名
flag_dropcat——每個類別特徵種類數大於這個數後，丟掉該類別特徵
輸出：name_len——list型別  [('feacture1', len), ('feacture2',len)]
dropcat_cols——list型別  要丟掉的特徵列名，種類太多 
'''#計算每個類別特徵中有多少種類
defnum_cat
(x):
eachcat = list(x.value_counts().index)
return len(eachcat)
catdata = data[feacture_cols]   
catdata_cols = list(catdata.columns)
name_len = zip(catdata_cols, lencatdata)
dropcat_cols = [x[0] for i ,x in enumerate(name_len) \
if name_len[i][1] > flag_dropcat]
return name_len, dropcat_cols

二、離散化數值特徵

#分箱:
defbinning
(col, cut_points, labels=none):
'''    函式說明：將連續特徵離散化，分為幾類
'''#define min and max values:
minval = col.min()
maxval = col.max()
#利用最大值和最小值建立分箱點的列表
break_points = [minval] + cut_points + [maxval]
#如果沒有標籤，則使用預設標籤0 ... (n-1)
ifnot labels:
labels = range(len(cut_points)+1)
#使用pandas的cut功能分箱
colbin = pd.cut(col,bins=break_points,labels=labels,include_lowest=true)
return colbin

三、去除大多數是同一值的特徵

def
drophighvalueoffeacture
(data, feacture_cols, prob_highvalue = 0.95):
'''    函式說明：剔除絕大多數（highvalue）為某一值的特徵，
輸入：data——整個資料集，包括index，target
feacture_cols——類別特徵名
prob_highvalue——剔除變數的標準比率
輸出：catdata——dataframe
'''#計算每個特徵中highvalue的數目
defnum_higevalue
(x):
return max(x.value_counts())
newdata = dataframe(newdata, columns=['num_higevalue'])
nexample = data.shape[0]
probvalue = map(lambda x: round(float(x)/nexample, 4), newdata['num_higevalue'])
newdata['probhighvalue'] = probvalue
#尋找大於prob_highvalue的特徵
dropfeacture = newdata[newdata['probhighvalue'] >= prob_highvalue]
dropfeacture_cols = list(dropfeacture.index)
return newdata,dropfeacture_cols

四、處理時間型特徵

from datetime import datetime, timedelta
defturntimetodayweekmonth
(listinginfo):
'''    函式說明：將形如2014/03/05時間量轉化為day week month
輸入：timefeacture——時間特徵
輸出：'''
defstrtotime
(x):
cday = datetime.strptime(x, '%y/%m/%d')
return cday
deftimetodayweekmonth
(x):
day = x.strftime('%j')  #每年第幾天
week = x.strftime('%w')  #星期幾
month = x.strftime('%m')  #幾月
return series([day,week,month])
dateoflisting  = pd.to_datetime(listinginfo)
dayweekmonthoflist = dataframe(dayweekmonthoflist.values,\
columns=['day','week','month'])
return dayweekmonthoflist

資料探勘學習之資料探索

資料探索是資料探勘必不可少的一環，資料探索技術會對模型準確率的提高帶來驚喜的效果。1 什麼是資料探索？答資料探索是通過繪圖計算等手段，分析資料集的資料質量資料的結構資料的趨勢和資料的關聯性，為資料探索之後的特徵工程階段打下堅實的基礎。2 資料探索的內容資料的質量分析和資料的特徵分析 2....

資料探勘資料探索

資料探索根據觀測調查收集到初步的樣本資料集後，接下來要考慮的問題是樣本資料集的數量和質量是否滿足模型構建的要求？有沒有出現從未設想過的資料狀態？其中有沒有什麼明顯的規律和趨勢？各因素之間有什麼樣的關聯性？資料探索就是通過檢驗資料集的資料質量繪製圖表計算某些特徵量等手段，對樣本資料集的結構和...

資料探勘導論筆記之探索資料

鳶尾花 iris 資料集可以從加州大學歐文分校 uci 的機器學習庫中獲取，包含150種鳶尾花的資訊，每50種取自三個原味鳶尾花種之一 setosa versicolour和virgincia。每個花的特徵用下面5種屬性描述。匯計是量化的，用單個數或數的小集合捕獲可能很大的值的各種特徵。匯計的。...

資料探勘之資料探索

資料探勘學習之資料探索

資料探勘 資料探索

資料探勘導論筆記之探索資料

相關推薦

資料探勘資料探索