python資料分析函式備忘錄

示例資料：

adults=pd.read_csv('data/adults.txt')

adults.head()

ageworkclass

final_weight

education

education_num

marital_status

occupation

relationship

race

***capital_gain

capital_loss

hours_per_week

native_country

salary039

state-gov

77516

bachelors

13never-married

adm-clerical

not-in-family

white

male

2174040

united-states

<=50k150

self-emp-not-inc

83311

bachelors

13married-civ-spouse

exec-managerial

husband

white

male00

13united-states

<=50k238

private

215646

hs-grad

9divorced

handlers-cleaners

not-in-family

white

male00

40united-states

<=50k353

private

234721

11th

7married-civ-spouse

handlers-cleaners

husband

black

male00

40united-states

<=50k428

private

338409

bachelors

13married-civ-spouse

prof-specialty

wife

black

female00

40cuba

<=50k

x=adults[['age','education','occupation','hours_per_week']].copy()

y=adults['salary'].copy()

x.education.unique() #返回元素唯一值

array(['bachelors', 'hs-grad', '11th', 'masters', '9th', 'some-college', 'assoc-acdm', 'assoc-voc', '7th-8th', 'doctorate', 'prof-school',

'5th-6th', '10th', '1st-4th', 'preschool', '12th'], dtype=object)

np.argwhere(x.education.unique()=='masters') #返回元素在series中的位置資訊

array([[3]])    # 注意返回的是乙個二維array，取值用下標[0][0],或者[0,0]

通常這個方法，把string型別資料進行數位化，做簡單分析可以

更好的做法是0ne-hots，如下：

pd.get_dummies

get_dummies(data, prefix=none, prefix_sep='_', dummy_na=false, columns=none, sparse=false, drop_first=false, dtype=none) -> 'dataframe'
convert categorical variable into dummy/indicator variables.

把分類變數向量化

examples
--------
>>> s = pd.series(list('abca'))
>>> pd.get_dummies(s)
a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1)
a  b
0  1  0
1  0  1
2  0  0
>>> pd.get_dummies(s1, dummy_na=true)
a  b  nan
0  1  0    0
1  0  1    0
2  0  0    1
>>> df = pd.dataframe()
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
c  col1_a  col1_b  col2_a  col2_b  col2_c
0  1       1       0       0       1       0
1  2       0       1       1       0       0
2  3       1       0       0       0       1
>>> pd.get_dummies(pd.series(list('abcaa')))
a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
4  1  0  0
>>> pd.get_dummies(pd.series(list('abcaa')), drop_first=true)
b  c
0  0  0
1  1  0
2  0  1
3  0  0
4  0  0
>>> pd.get_dummies(pd.series(list('abc')), dtype=float)
a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0

edu_dummies=pd.get_dummies(x.education,prefix=['edu'])

x_edu_dummies=pd.concat([x,edu_dummies],axis=1)

x_edu_dummies.drop('education',axis=1)

類似python內建的map()方法，pandas中的map()方法將函式、字典索引或是一些需要接受單個輸入值的特別的物件與對應的單個列的每乙個元素建立聯絡並序列得到結果。

#定義f->女性，m->男性的對映字典

gender2xb =

#利用map()方法得到對應gender列的對映列

data.gender.map(gender2xb)

也可以用lambda函式：

#因為已經知道資料gender列性別中只有f和m所以編寫如下lambda函式

data.gender.map(lambda x:'女性' if x is 'f' else '男性')

也可以定義函式：

def gender_to_xb(x):

return '女性' if x is 'f' else '男性'

data.gender.map(gender_to_xb)

也可以用字串格式化：

data.gender.map("this kid's gender is {}".format)

可以看到，這裡返回的是單列結果，每個元素是返回值組成的元組，這時若想直接得到各列分開的結果，需要用到zip(*zipped)來解開元組序列，從而得到分離的多列返回值：

print(a[:10])

print(b[:10])

譬如下面的簡單示例，我們把嬰兒姓名資料中所有的字元型資料訊息小寫化處理，對其他型別則原樣返回：

函式備忘錄

php 雜項函式 strip whitespace 函式返回已刪除 php 注釋以及空白字元的源檔案。該函式對於檢測指令碼中的實際量很有用。php 過濾器用於對來自非安全的資料比如使用者輸入進行驗證和過濾。filter 函式是 php 核心的組成部分。無需安裝即可使用這些函式。php 指示...

python方法備忘錄

1 is instance arg1，arg2 查詢arg1的型別是否是arg2 from collections import iterable print isinstance a,iterable 查詢 a 是否是可迭代物件 2 重新匯入模組 import test from ipm impo...

python學習備忘錄 1

前段時間學了shell指令碼，看的乙個頭痛，主要是語法太犀利了，看完基本忘了很多，以至於上篇系列文章很多還停在草稿階段，等假期再補上把。無意中發現了python，這玩意靈活而且語法跟c有相似風格，起碼不要在寫做條件測試了吧，話說shell的測試和控制流程語法都太犀利的，以至於要經常翻才行。不多說了，...

python資料分析函式備忘錄

函式備忘錄

python方法備忘錄

python學習備忘錄 1

相關推薦