python資料分析實踐（三）

處理維基百科all time olympic games medals資料集。

import pandas as pd
# 讀取資料，選取第一列作為index，並跳過第一行，以第二行作為我的column name
df = pd.read_csv(
'olympics.csv'
, index_col=
0, skiprows=1)
# 對column進行部分重新命名
for col in df.columns:
if col[:2
]=='01'
:        df.rename(columns=
, inplace=
true
)if col[:2
]=='02'
:        df.rename(columns=
, inplace=
true
)if col[:2
]=='03'
:        df.rename(columns=
, inplace=
true
)if col[:1
]=='№':
df.rename(columns=
, inplace=
true
)names_ids = df.index.
str.split(
'\s\('
)# split the index by '('
df.index = names_ids.
str[0]
# the [0] element is the country name (new index) 
df['id'
]= names_ids.
str[1]
.str[:
3]# the [1] element is the abbreviation or id (take first 3 characters from that)
# 刪除total列
df = df.drop(
'totals'
)df.head(
)

任務一：返回dataframe的第一行

def
answer_zero()
:return df.iloc[
0]

必須使用iloc，iloc使用index進行索引，loc使用index name進行索引。

任務二：返回夏季奧運會獲得金牌數最多的國家名

def
answer_one()
:return df.gold.idxmax(
)

一開始考慮使用布林量索引，df[df.gold==df.gold.max()].index[0]，非常繁瑣，結果我去查series文件發現有非常直接簡單的方法。採用**idxmax(), idxmin()**返回series最大最小值的索引值。

任務三：返回夏季金牌數和冬季金牌數之差相對於總金牌數最大的國家名（只考慮在夏季冬季奧運會至少都獲得過1枚金牌的國家）

夏季金

牌數−冬

季金牌數

總金牌數

\frac

總金牌數夏季

金牌數−

冬季金牌

數

def
answer_three()
:    df1 = df[
(df.gold >0)
&(df[
'gold.1'
]>0)
]return
abs(
(df1[
'gold.1'
]-df1.gold)
/df1[
'gold.2'])
.idxmax(
)

任務四：返回乙個名為"points"的series

points

=gold.2×3

+silver.2×2

+bronze.2×1

\text=\text\times 3 + \text\times 2 + \text\times 1

points

=gold.2×3

+silver.2×2

+bronze.2×1

def
answer_four()
:return pd.series(data = df[
'gold.2']*
3+ df[
'silver.2']*
2+ df[
'bronze.2']*
1,name=
'points'
)

處理來自united states census bureau的人口普查資料集。

import pandas as pd
census_df = pd.read_csv(
'census.csv'
)

任務一：返回擁有最多country的state

def
answer_five()
:return census_df[census_df.stname != census_df.ctyname]
.stname.value_counts(
).idxmax(
)

任務二：返回在每個state僅考慮人數最多的三個country時的三個人數最多的state（2010）

def
answer_six()
:    census_df1 = census_df[census_df.stname != census_df.ctyname][[
'state'
,'stname'
,'ctyname'
,'census2010pop']]
.set_index(
'state'
)    new_df = pd.dataframe(columns=
['state'
,'stname'
,'ctyname'
,'census2010pop'])
for i in
range(56
):try:
new_df = pd.concat(
[new_df, census_df1.loc[i+1]
.sort_values(by=
'census2010pop'
,ascending=
false
).iloc[:3
].reset_index()]
)except keyerror:
pass
return
list
(new_df.groupby(
['stname'])
.sum()
.census2010pop.sort_values(ascending=
false
).iloc[:3
].index)

任務三：返回10-15年人數變化最大的country

def
answer_seven()
:    census_df1 = census_df[census_df.stname != census_df.ctyname][[
'ctyname'
,'popestimate2010'
,'popestimate2011'
,'popestimate2012'
,'popestimate2013'
,'popestimate2014'
,'popestimate2015']]
.set_index(
'ctyname'
)return
(census_df1.
max(axis=1)
-census_df1.
min(axis=1)
).idxmax(
)

任務四：返回屬於region1或region2，名字始於』washington』且15年人數多於14年人數的country

def
answer_eight()
:return census_df[
(census_df.region <3)
&(census_df.popestimate2015 > census_df.popestimate2014)
&(census_df.ctyname.
str[:10
]=='washington')]
[['stname'
,'ctyname'
]]

資料分析（三）

pandas的資料結構匯入pandas 資料分析三劍客 numpy pandas matplotlib 三劍客 import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas import ser...

python資料分析 numpy使用（三）

1 矩陣的輸出先初始import numpy as np a np.arange 3,15 reshape 3,4 1.1 按行輸出 for row in a 按列輸出，按行就把轉置去掉 print row 1.2 按列輸出 for row in a.t 按列輸出，按行就把轉置去掉 print r...

python資料分析之pandas（三）Index

index物件宣告後不能改變，不同的資料結構公用index物件 1.index物件的方法 frame.idxmin frame.idxmax 索引值最小和最大元素 2.含有重複標籤的index index a a frame.index.is unique 索引是否唯一 3.更換索引 frame.r...

python資料分析實踐（三）

資料分析（三）

python資料分析 numpy使用（三）

python資料分析之pandas（三）Index

相關推薦