groupby agg 自定義聚合函式

2021-10-19 18:00:55 字數 4607 閱讀 3768

college = pd.read_csv('data/college.csv')

college.head()

instnm

city

stabbr

hbcu

menonly

womenonly

relaffil

satvrmid

satmtmid

distanceonly

...ugds_2mor

ugds_nra

ugds_unkn

pptug_ef

curroper

pctpell

pctfloan

ug25abv

md_earn_wne_p10

grad_debt_mdn_supp

0alabama a & m university

normal

al1.0

0.00.0

0424.0

420.0

0.0...

0.0000

0.0059

0.0138

0.0656

10.7356

0.8284

0.1049

30300

33888

1university of alabama at birmingham

birmingham

al0.0

0.00.0

0570.0

565.0

0.0...

0.0368

0.0179

0.0100

0.2607

10.3460

0.5214

0.2422

39700

21941.5

2amridge university

montgomery

al0.0

0.00.0

1nan

nan1.0

...0.0000

0.0000

0.2715

0.4536

10.6801

0.7795

0.8540

40100

23370

3university of alabama in huntsville

huntsville

al0.0

0.00.0

0595.0

590.0

0.0...

0.0172

0.0332

0.0350

0.2146

10.3072

0.4596

0.2640

45500

24097

4alabama state university

montgomery

al1.0

0.00.0

0425.0

430.0

0.0...

0.0098

0.0243

0.0137

0.0892

10.7347

0.7554

0.1270

26600

33118.5

5 rows × 27 columns

# 求出每個州的本科生的平均值和標準差

college.groupby('stabbr')['ugds'].agg(['mean', 'std']).round(0).head()

mean

stdstabbr

ak2493.0

4052.0

al2790.0

4658.0

ar1644.0

3143.0

as1276.0

nanaz

4130.0

14894.0

遠離平均值的標準差的最大個數,寫乙個自定義函式

# 遠離平均值的標準差的最大個數,寫乙個自定義函式(z-score 標準化)

def max_deviation(s):

std_score = (s - s.mean()) / s.std()

return std_score.abs().max()

# agg聚合函式在呼叫方法時,直接引入自定義的函式名

college.groupby('stabbr')['ugds'].agg(max_deviation).round(1).head()

'''stabbr

ak 2.6

al 5.8

ar 6.3

as nan

az 9.9

name: ugds, dtype: float64

'''

# 自定義的聚合函式也適用於多個數值列

college.groupby('stabbr')['ugds', 'satvrmid','satmtmid'].agg(max_deviation).round(1).head()

ugds

satvrmid

satmtmid

stabbr

ak2.6

nannan

al5.8

1.61.8

ar6.3

2.22.3

asnan

nannan

az9.9

1.91.4

# 自定義聚合函式也可以和預先定義的函式一起使用

college.groupby(['stabbr', 'relaffil'])['ugds', 'satvrmid', 'satmtmid']\

.agg([max_deviation, 'mean', 'std']).round(1).head()

ugds

satvrmid

satmtmid

max_deviation

mean

stdmax_deviation

mean

stdmax_deviation

mean

stdstabbr

relaffilak0

2.13508.9

4539.5

nannan

nannan

nannan

11.1

123.3

132.9

nan555.0

nannan

503.0

nanal

05.2

3248.8

5102.4

1.6514.9

56.5

1.7515.8

56.7

12.4

979.7

870.8

1.5498.0

53.0

1.4485.6

61.4ar0

5.81793.7

3401.6

1.9481.1

37.9

2.0503.6

39.0

# pandas使用函式名作為返回列的名字;你可以直接使用rename方法修改,或通過__name__屬性修改

max_deviation.__name__

#'max_deviation'

max_deviation.__name__ = 'max deviation'

college.groupby(['stabbr', 'relaffil'])['ugds', 'satvrmid', 'satmtmid']\

.agg([max_deviation, 'mean', 'std']).round(1).head()

ugds

satvrmid

satmtmid

max deviation

mean

stdmax deviation

mean

stdmax deviation

mean

stdstabbr

relaffilak0

2.13508.9

4539.5

nannan

nannan

nannan

11.1

123.3

132.9

nan555.0

nannan

503.0

nanal

05.2

3248.8

5102.4

1.6514.9

56.5

1.7515.8

56.7

12.4

979.7

870.8

1.5498.0

53.0

1.4485.6

61.4ar0

5.81793.7

3401.6

1.9481.1

37.9

2.0503.6

39.0

自定義聚合函式

新建database project 新建concatenate class using system using system.data using microsoft.sqlserver.server using system.data.sqltypes using system.io usin...

自定義聚合函式

create or replace type string sum obj as object 聚合函式的實質就是乙個物件 sum string varchar2 4000 static function odciaggregateinitialize v self in out string su...

Pandas groupby 自定義聚合函式

自定義聚合函式,n.i.o 出現次數0 0 n.i.o 出現次數1 進一步判斷 n.i.o 出現次數大於2 2 此函式需進一步擴充套件,出現次數大於2的,需要根據計畫頻率,判斷是否在一次連續測試內 defpeak peak arr,df 判斷arr的series值中是否包含 n.i.o x list...