12 Pandas的索引的用途

2021-10-16 17:15:45 字數 4412 閱讀 9753

把資料儲存於普通的column列也能用於資料查詢,那使用index有什麼好處?

index的用途總結:

更方便的資料查詢;

使用index可以獲得效能提公升;

自動的資料對齊功能;

更多更強大的資料結構支援;

import pandas as pd
df = pd.read_csv(

"./datas/ml-latest-small/ratings.csv"

)

df.head(

)

userid

movieid

rating

timestamp01

14.0

96498270311

34.0

96498124721

64.0

96498222431

475.0

96498381541

505.0

964982931

df.count(

)

userid       100836

movieid 100836

rating 100836

timestamp 100836

dtype: int64

# drop==false,讓索引列還保持在column

df.set_index(

"userid"

, inplace=

true

, drop=

false

)

df.head(

)

userid

movieid

rating

timestamp

userid11

14.0

96498270311

34.0

96498124711

64.0

96498222411

475.0

96498381511

505.0

964982931

df.index
int64index([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,

...610, 610, 610, 610, 610, 610, 610, 610, 610, 610],

dtype='int64', name='userid', length=100836)

# 使用index的查詢方法

df.loc[

500]

.head(

5)

userid

movieid

rating

timestamp

userid

500500

14.0

1005527755

500500

111.0

1005528017

500500

391.0

1005527926

500500

1011.0

1005527980

500500

1044.0

1005528065

# 使用column的condition查詢方法

df.loc[df[

"userid"]==

500]

.head(

)

userid

movieid

rating

timestamp

userid

500500

14.0

1005527755

500500

111.0

1005528017

500500

391.0

1005527926

500500

1011.0

1005527980

500500

1044.0

1005528065

# 將資料隨機打散

from sklearn.utils import shuffle

df_shuffle = shuffle(df)

df_shuffle.head(

)

userid

movieid

rating

timestamp

userid

160160

2340

1.0985383314

129129

1136

3.51167375403

167167

44191

4.51154718915

536536

2763.0

832839990

6767

5952

2.01501274082

# 索引是否是遞增的

df_shuffle.index.is_monotonic_increasing

false
df_shuffle.index.is_unique
false
# 計時,查詢id==500資料效能

%timeit df_shuffle.loc[

500]

376 µs ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df_sorted = df_shuffle.sort_index(

)

df_sorted.head(

)

userid

movieid

rating

timestamp

userid11

2985

4.096498303411

2617

2.096498258811

3639

4.096498227111

64.0

96498222411

7334.0

964982400

# 索引是否是遞增的

df_sorted.index.is_monotonic_increasing

true
df_sorted.index.is_unique
false
%timeit df_sorted.loc[

500]

203 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
包括series和dataframe

s1 = pd.series([1

,2,3

], index=

list

("abc"

))

s1
a    1

b 2

c 3

dtype: int64

s2 = pd.series([2

,3,4

], index=

list

("bcd"

))

s2
b    2

c 3

d 4

dtype: int64

s1+s2
a    nan

b 4.0

c 6.0

d nan

dtype: float64

很多強大的索引資料結構

pandas資料的索引操作

coding utf 8 series索引 行索引 import pandas as pd import numpy as np ser obj pd.series range 5 index a b c d e print ser obj 行索引獲單個值 print ser obj b ser o...

pandas分層索引的操作

pandas文件 arrays bar bar baz baz foo foo qux qux one two one two one two one two tuples index list zip arrays index pd.multiindex.from tuples tuples,na...

pandas 索引的儲存和讀取

以csv為例,如果在儲存csv檔案時把索引也儲存上了,那麼在預設讀取時pandas會自動再加一列索引,從0到n 1。原來的索引會成為一列普通的內容。如下例所示。df.to csv a.csv encoding utf 8 sig index true,sep df pd.read csv a.csv...