pandas的duplicated()判斷重複值記錄
pandas的drop_duplicates()刪除資料記錄,可指定特定列或全部
numpy中unique()返回所有不同的值,且按照從小到大的順序
set(),python自帶內建函式,也能返回唯一元素的集合
示例:重複值處理
import pandas as pd
data1=['a',1]
data2=['a',1]
data3=['b',2]
data4=['b',2]
data=pd.dataframe([data1,data2,data3,data4],columns=['col1','col2'])
print(data)
#判斷isduplicated=data.duplicated()
print(isduplicated)
#刪除new_1=data.drop_duplicates()
new_2=data.drop_duplicates(['col1'])
new_3=data.drop_duplicates(['col1','col2'])
print(new_1)
print(new_2)
print(new_3)
結果:
col1 col2
0 a 1
1 a 1
2 b 2
3 b 2
0 false
1 true
2 false
3 true
dtype: bool
col1 col2
0 a 1
2 b 2
col1 col2
0 a 1
2 b 2
col1 col2
0 a 1
2 b 2
示例:資料清洗
import re
#載入正規表示式庫
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.preprocessing import labelencoder
from sklearn.ensemble import randomforestregressor
from sklearn.ensemble import gradientboostingregressor
#特徵工程處理
train_df_org=pd.read_csv('train.csv')
test_df_org=pd.read_csv('test.csv')
test_df_org['survived']=0
#---pclass欄位---建立pcalss fare category
def pclass_fare_category(df,pclass1_mean_fare,pclass2_mean_fare,pclass3_mean_fare):
if df['pclass']==1:
if df['fare']<=pclass1_mean_fare:
return 'pclass1_low'
else:
return 'pclass1_high'
elif df['pclass']==2:
if df['fare']<=pclass2_mean_fare:
return 'pclass2_low'
else:
return 'pclass2_high'
elif df['pclass']==3:
if df['fare']<=pclass3_mean_fare:
return 'pclass3_low'
else:
return 'pclass3_high'
pclass1_mean_fare=combined_train_test['fare'].groupby(by=combined_train_test['pclass']).mean().get([1]).values[0] //取pclass=1的艙的平均票價
pclass2_mean_fare=combined_train_test['fare'].groupby(by=combined_train_test['pclass']).mean().get([2]).values[0]
pclass3_mean_fare=combined_train_test['fare'].groupby(by=combined_train_test['pclass']).mean().get([3]).values[0]
print('# pclass_fare_category...')
print(combined_train_test.groupby(['pclass_fare_category','survived'])['survived'].count())
結果:
#/ pclass_fare_category…
pclass_fare_category survived
pclass1_high 0 49
1 48
pclass1_low 0 138
1 88
pclass2_high 0 68
1 43
pclass2_low 0 122
1 44
pclass3_high 0 174
1 42
pclass3_low 0 416
1 77
name: survived, dtype: int64
Jupyter 資料重複值處理
import os import pandas as pd import numpy as np os.chdir d workspaces jupyter df pd.read excel data test.xlsx df 重複的是true df.duplicated 顯示 df df.dupl...
資料處理之重複值,缺失值,空格值的處理
去除重複值在python中主要是用drop duplicates 函式,接下來做個小示範 這邊是我的檔案路徑,如果你想實現此功能需要輸入自己的檔案路徑 coding utf 8 import pandas as pd df pd.read csv r users herenyi downloads ...
Python資料預處理(刪除重複值和空值)
pandas幾個函式的使用,大資料的預處理 刪除重複值和空值 人工刪除很麻煩 python恰好能夠解決 注釋很詳細在這不一一解釋了 讀寫excel xls xlsx 檔案 import pandas as pd import numpy as np df excel pd.read excel da...