點滴 python資料集的文字標籤轉換成數值

資料科學中經常出現的標籤為文字型的，因為需要轉化為數值型，以便後續處理。

參考一、

get_dummies 是利用pandas實現one hot encode的方式。詳細引數請檢視官方文件

官方文件在這裡

get_dummies 前：

get_dummies 後：

另：pd.get_dummies(df.color)

df = df.join(pd.get_dummies(df.color))

參考二、

利用pandas的categorical()

import pandas as pd

c = ['a','a','a','b','b','c','c','c','c']

category = pd.categorical(c)

print category.codes

參考三、

利用 sklearn

from sklearn.preprocessing import labelencoder

le = labelencoder()

le.fit([1,5,67,100])

le.transform([1,1,100,67,5])

輸出： array([0,0,3,2,1])

#onehotencoder 用於將表示分類的資料擴維：

from sklearn.preprocessing import onehotencoder

ohe = onehotencoder()

ohe.fit([[1],[2],[3],[4]])

ohe.transform([2],[3],[1],[4]).toarray()

輸出：[ [0,1,0,0] , [0,0,1,0] , [1,0,0,0] ,[0,0,0,1] ]

參考四、

利用keras中的keras.utils.to_categorical方法

to_categorical(y, num_classes=none, dtype='float32')

將整型標籤轉為onehot。y為int陣列，num_classes為標籤類別總數，大於max(y)（標籤從0開始的）。

delphi學習點滴資料集過濾技巧

當我們在運算元據集時,往往需要對資料進行篩選例如乙個名為customer的資料表,它具有custno custname country address phone state taxrate等字段,如果只想檢視國別為china或顧客號大於1000的顧客記錄,就需要對資料集進行過濾。經總結,有下面這...

文字識別文字檢測資料集

1 chinese text in wild ctw 包含平面文字，凸起文字，城市文字，農村文字，低亮度文字，遠處文字，部分遮擋文字。影象大小2048 2048，資料集大小為31gb。8 1 1 比例訓練集 25887張影象，812872個漢字測試集 3269張影象，103519個漢字驗證集 3...

python案例資料集 Python資料集切分例項

在處理資料過程中經常要把資料集切分為訓練集和測試集，因此記錄一下切分 data 資料集 test ratio 測試機占比如果data為numpy.numpy.ndarray直接使用此如果data為pandas.datframe型別則 return data train indices data ...

點滴 python資料集的文字標籤轉換成數值

delphi學習點滴 資料集過濾技巧

文字識別 文字檢測資料集

python案例資料集 Python資料集切分例項

相關推薦

delphi學習點滴資料集過濾技巧

文字識別文字檢測資料集