12 Pandas的索引的用途

把資料儲存於普通的column列也能用於資料查詢，那使用index有什麼好處？

index的用途總結：

更方便的資料查詢；

使用index可以獲得效能提公升；

自動的資料對齊功能；

更多更強大的資料結構支援；

import pandas as pd

df = pd.read_csv(
"./datas/ml-latest-small/ratings.csv"
)

df.head(
)

userid

movieid

rating

timestamp01

14.0

96498270311

34.0

96498124721

64.0

96498222431

475.0

96498381541

505.0

964982931

df.count(
)

userid 100836 movieid 100836 rating 100836 timestamp 100836 dtype: int64

# drop==false，讓索引列還保持在column
df.set_index(
"userid"
, inplace=
true
, drop=
false
)

df.head(
)

userid

movieid

rating

timestamp

userid11

14.0

96498270311

34.0

96498124711

64.0

96498222411

475.0

96498381511

505.0

964982931

df.index

int64index([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
...610, 610, 610, 610, 610, 610, 610, 610, 610, 610],
dtype='int64', name='userid', length=100836)

# 使用index的查詢方法
df.loc[
500]
.head(
5)

userid

movieid

rating

timestamp

userid

500500

14.0

1005527755

500500

111.0

1005528017

500500

391.0

1005527926

500500

1011.0

1005527980

500500

1044.0

1005528065

# 使用column的condition查詢方法
df.loc[df[
"userid"]==
500]
.head(
)

userid

movieid

rating

timestamp

userid

500500

14.0

1005527755

500500

111.0

1005528017

500500

391.0

1005527926

500500

1011.0

1005527980

500500

1044.0

1005528065

# 將資料隨機打散
from sklearn.utils import shuffle
df_shuffle = shuffle(df)

df_shuffle.head(
)

userid

movieid

rating

timestamp

userid

160160

2340

1.0985383314

129129

1136

3.51167375403

167167

44191

4.51154718915

536536

2763.0

832839990

6767

5952

2.01501274082

# 索引是否是遞增的 df_shuffle.index.is_monotonic_increasing

false

df_shuffle.index.is_unique

false

# 計時，查詢id==500資料效能
%timeit df_shuffle.loc[
500]

376 µs ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df_sorted = df_shuffle.sort_index(
)

df_sorted.head(
)

userid

movieid

rating

timestamp

userid11

2985

4.096498303411

2617

2.096498258811

3639

4.096498227111

64.0

96498222411

7334.0

964982400

# 索引是否是遞增的 df_sorted.index.is_monotonic_increasing

true

df_sorted.index.is_unique

false

%timeit df_sorted.loc[
500]

203 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

包括series和dataframe

s1 = pd.series([1
,2,3
], index=
list
("abc"
))

s1

a 1 b 2 c 3 dtype: int64

s2 = pd.series([2
,3,4
], index=
list
("bcd"
))

s2

b 2 c 3 d 4 dtype: int64

s1+s2

a nan b 4.0 c 6.0 d nan dtype: float64

很多強大的索引資料結構

pandas資料的索引操作

coding utf 8 series索引行索引 import pandas as pd import numpy as np ser obj pd.series range 5 index a b c d e print ser obj 行索引獲單個值 print ser obj b ser o...

pandas分層索引的操作

pandas文件 arrays bar bar baz baz foo foo qux qux one two one two one two one two tuples index list zip arrays index pd.multiindex.from tuples tuples,na...

pandas 索引的儲存和讀取

以csv為例，如果在儲存csv檔案時把索引也儲存上了，那麼在預設讀取時pandas會自動再加一列索引，從0到n 1。原來的索引會成為一列普通的內容。如下例所示。df.to csv a.csv encoding utf 8 sig index true,sep df pd.read csv a.csv...

12 Pandas的索引的用途

pandas資料的索引操作

pandas分層索引的操作

pandas 索引的儲存和讀取

相關推薦