college = pd.read_csv('data/college.csv')
college.head()
instnm
city
stabbr
hbcu
menonly
womenonly
relaffil
satvrmid
satmtmid
distanceonly
...ugds_2mor
ugds_nra
ugds_unkn
pptug_ef
curroper
pctpell
pctfloan
ug25abv
md_earn_wne_p10
grad_debt_mdn_supp
0alabama a & m university
normal
al1.0
0.00.0
0424.0
420.0
0.0...
0.0000
0.0059
0.0138
0.0656
10.7356
0.8284
0.1049
30300
33888
1university of alabama at birmingham
birmingham
al0.0
0.00.0
0570.0
565.0
0.0...
0.0368
0.0179
0.0100
0.2607
10.3460
0.5214
0.2422
39700
21941.5
2amridge university
montgomery
al0.0
0.00.0
1nan
nan1.0
...0.0000
0.0000
0.2715
0.4536
10.6801
0.7795
0.8540
40100
23370
3university of alabama in huntsville
huntsville
al0.0
0.00.0
0595.0
590.0
0.0...
0.0172
0.0332
0.0350
0.2146
10.3072
0.4596
0.2640
45500
24097
4alabama state university
montgomery
al1.0
0.00.0
0425.0
430.0
0.0...
0.0098
0.0243
0.0137
0.0892
10.7347
0.7554
0.1270
26600
33118.5
5 rows × 27 columns
# 求出每個州的本科生的平均值和標準差
college.groupby('stabbr')['ugds'].agg(['mean', 'std']).round(0).head()
mean
stdstabbr
ak2493.0
4052.0
al2790.0
4658.0
ar1644.0
3143.0
as1276.0
nanaz
4130.0
14894.0
遠離平均值的標準差的最大個數,寫乙個自定義函式
# 遠離平均值的標準差的最大個數,寫乙個自定義函式(z-score 標準化)
def max_deviation(s):
std_score = (s - s.mean()) / s.std()
return std_score.abs().max()
# agg聚合函式在呼叫方法時,直接引入自定義的函式名
college.groupby('stabbr')['ugds'].agg(max_deviation).round(1).head()
'''stabbr
ak 2.6
al 5.8
ar 6.3
as nan
az 9.9
name: ugds, dtype: float64
'''
# 自定義的聚合函式也適用於多個數值列
college.groupby('stabbr')['ugds', 'satvrmid','satmtmid'].agg(max_deviation).round(1).head()
ugds
satvrmid
satmtmid
stabbr
ak2.6
nannan
al5.8
1.61.8
ar6.3
2.22.3
asnan
nannan
az9.9
1.91.4
# 自定義聚合函式也可以和預先定義的函式一起使用
college.groupby(['stabbr', 'relaffil'])['ugds', 'satvrmid', 'satmtmid']\
.agg([max_deviation, 'mean', 'std']).round(1).head()
ugds
satvrmid
satmtmid
max_deviation
mean
stdmax_deviation
mean
stdmax_deviation
mean
stdstabbr
relaffilak0
2.13508.9
4539.5
nannan
nannan
nannan
11.1
123.3
132.9
nan555.0
nannan
503.0
nanal
05.2
3248.8
5102.4
1.6514.9
56.5
1.7515.8
56.7
12.4
979.7
870.8
1.5498.0
53.0
1.4485.6
61.4ar0
5.81793.7
3401.6
1.9481.1
37.9
2.0503.6
39.0
# pandas使用函式名作為返回列的名字;你可以直接使用rename方法修改,或通過__name__屬性修改
max_deviation.__name__
#'max_deviation'
max_deviation.__name__ = 'max deviation'
college.groupby(['stabbr', 'relaffil'])['ugds', 'satvrmid', 'satmtmid']\
.agg([max_deviation, 'mean', 'std']).round(1).head()
ugds
satvrmid
satmtmid
max deviation
mean
stdmax deviation
mean
stdmax deviation
mean
stdstabbr
relaffilak0
2.13508.9
4539.5
nannan
nannan
nannan
11.1
123.3
132.9
nan555.0
nannan
503.0
nanal
05.2
3248.8
5102.4
1.6514.9
56.5
1.7515.8
56.7
12.4
979.7
870.8
1.5498.0
53.0
1.4485.6
61.4ar0
5.81793.7
3401.6
1.9481.1
37.9
2.0503.6
39.0
自定義聚合函式
新建database project 新建concatenate class using system using system.data using microsoft.sqlserver.server using system.data.sqltypes using system.io usin...
自定義聚合函式
create or replace type string sum obj as object 聚合函式的實質就是乙個物件 sum string varchar2 4000 static function odciaggregateinitialize v self in out string su...
Pandas groupby 自定義聚合函式
自定義聚合函式,n.i.o 出現次數0 0 n.i.o 出現次數1 進一步判斷 n.i.o 出現次數大於2 2 此函式需進一步擴充套件,出現次數大於2的,需要根據計畫頻率,判斷是否在一次連續測試內 defpeak peak arr,df 判斷arr的series值中是否包含 n.i.o x list...