有時候需要將特徵名稱轉化為變數,也就是將資料集由橫向改為縱向,或者為轉秩。使用場景如下:
# 資料集
in [5]: test
out[5]:
tweet_id doggo floofer pupper puppo
0675003128568291329
none
none
none
none
1786233965241827333
none
none
none
none
2683481228088049664
none
none pupper none
3675497103322386432
none
none
none
none
# 先設定index,再使用.stack()方法由橫向變縱向,對特徵進行命名
in [6]: s1 = test.set_index('tweet_id').stack().rename('stage')
in [7]: s1
out[7]:
tweet_id
675003128568291329 doggo none
floofer none
pupper none
puppo none
786233965241827333 doggo none
floofer none
pupper none
puppo none
683481228088049664 doggo none
floofer none
pupper pupper
puppo none
675497103322386432 doggo none
floofer none
pupper none
puppo none
name: stage, dtype: object
# 將多重索引reset
in [8]: s2 = s1.reset_index()
in [9]: s2
out[9]:
tweet_id level_1 stage
0675003128568291329 doggo none
1675003128568291329 floofer none
2675003128568291329 pupper none
3675003128568291329 puppo none
4786233965241827333 doggo none
5786233965241827333 floofer none
6786233965241827333 pupper none
7786233965241827333 puppo none
8683481228088049664 doggo none
9683481228088049664 floofer none
10683481228088049664 pupper pupper
11683481228088049664 puppo none
12675497103322386432 doggo none
13675497103322386432 floofer none
14675497103322386432 pupper none
15675497103322386432 puppo none
# 將level_1列刪除,同時stage列只保留不為none的資料
in [10]: s2.drop(['level_1'], axis=1, inplace=true)
in [11]: s3 = s2[s2.stage != 'none']
in [12]: s3
out[12]:
tweet_id stage
10683481228088049664 pupper
# 跟原始資料集進行合併
in [14]: result = pd.merge(test, s3, how='left', on='tweet_id')
in [15]: result
out[15]:
tweet_id doggo floofer pupper puppo stage
0675003128568291329
none
none
none
none nan
1786233965241827333
none
none
none
none nan
2683481228088049664
none
none pupper none pupper
3675497103322386432
none
none
none
none nan
# 刪除中間特徵,得到最終結果
in [16]: result.drop(['doggo','floofer','pupper','puppo'], axis=1)
out[16]:
tweet_id stage
0675003128568291329 nan
1786233965241827333 nan
2683481228088049664 pupper
3675497103322386432 nan
in [17]: test
out[17]:
tweet_id doggo floofer pupper puppo
0675003128568291329
none
none
none
none
1786233965241827333
none
none
none
none
2683481228088049664
none
none pupper none
3675497103322386432
none
none
none
none
應該有更為簡便易行的方法。後續補充。 pandas中的stack與unstack簡單描述
在用pandas進行資料重排時,經常用到stack和unstack兩個函式。stack簡單理解可以是堆疊,堆積,unstack即 不要堆疊 下面為較為淺顯的講述該方法,並未涉及到多標籤的問題。常見的資料的層次化結構有兩種,一種是 一種是 花括號 即下面這樣的l兩種形式 在行列方向上均有索引 類似於d...
Pandas中melt 的使用
pandas.melt 使用引數 pandas.melt frame,id vars none,value vars none,var name none,value name value col level none 引數解釋 frame 要處理的資料集。id vars 不需要被轉換的列名。val...
Pandas中pivot的使用
pivot函式用於從給定的表中建立出新的派生表,pivot有三個引數 索引 列和值。具體如下 def pivot index,columns,values produce pivot table based on 3 columns of this dataframe.uses unique val...