把資料儲存於普通的column列也能用於資料查詢,那使用index有什麼好處?
index的用途總結:
更方便的資料查詢;
使用index可以獲得效能提公升;
自動的資料對齊功能;
更多更強大的資料結構支援;
import pandas as pd
df = pd.read_csv(
"./datas/ml-latest-small/ratings.csv"
)
df.head(
)
userid
movieid
rating
timestamp01
14.0
96498270311
34.0
96498124721
64.0
96498222431
475.0
96498381541
505.0
964982931
df.count(
)
userid 100836
movieid 100836
rating 100836
timestamp 100836
dtype: int64
# drop==false,讓索引列還保持在column
df.set_index(
"userid"
, inplace=
true
, drop=
false
)
df.head(
)
userid
movieid
rating
timestamp
userid11
14.0
96498270311
34.0
96498124711
64.0
96498222411
475.0
96498381511
505.0
964982931
df.index
int64index([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
...610, 610, 610, 610, 610, 610, 610, 610, 610, 610],
dtype='int64', name='userid', length=100836)
# 使用index的查詢方法
df.loc[
500]
.head(
5)
userid
movieid
rating
timestamp
userid
500500
14.0
1005527755
500500
111.0
1005528017
500500
391.0
1005527926
500500
1011.0
1005527980
500500
1044.0
1005528065
# 使用column的condition查詢方法
df.loc[df[
"userid"]==
500]
.head(
)
userid
movieid
rating
timestamp
userid
500500
14.0
1005527755
500500
111.0
1005528017
500500
391.0
1005527926
500500
1011.0
1005527980
500500
1044.0
1005528065
# 將資料隨機打散
from sklearn.utils import shuffle
df_shuffle = shuffle(df)
df_shuffle.head(
)
userid
movieid
rating
timestamp
userid
160160
2340
1.0985383314
129129
1136
3.51167375403
167167
44191
4.51154718915
536536
2763.0
832839990
6767
5952
2.01501274082
# 索引是否是遞增的
df_shuffle.index.is_monotonic_increasing
false
df_shuffle.index.is_unique
false
# 計時,查詢id==500資料效能
%timeit df_shuffle.loc[
500]
376 µs ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df_sorted = df_shuffle.sort_index(
)
df_sorted.head(
)
userid
movieid
rating
timestamp
userid11
2985
4.096498303411
2617
2.096498258811
3639
4.096498227111
64.0
96498222411
7334.0
964982400
# 索引是否是遞增的
df_sorted.index.is_monotonic_increasing
true
df_sorted.index.is_unique
false
%timeit df_sorted.loc[
500]
203 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
包括series和dataframe
s1 = pd.series([1
,2,3
], index=
list
("abc"
))
s1
a 1
b 2
c 3
dtype: int64
s2 = pd.series([2
,3,4
], index=
list
("bcd"
))
s2
b 2
c 3
d 4
dtype: int64
s1+s2
a nan
b 4.0
c 6.0
d nan
dtype: float64
很多強大的索引資料結構 pandas資料的索引操作
coding utf 8 series索引 行索引 import pandas as pd import numpy as np ser obj pd.series range 5 index a b c d e print ser obj 行索引獲單個值 print ser obj b ser o...
pandas分層索引的操作
pandas文件 arrays bar bar baz baz foo foo qux qux one two one two one two one two tuples index list zip arrays index pd.multiindex.from tuples tuples,na...
pandas 索引的儲存和讀取
以csv為例,如果在儲存csv檔案時把索引也儲存上了,那麼在預設讀取時pandas會自動再加一列索引,從0到n 1。原來的索引會成為一列普通的內容。如下例所示。df.to csv a.csv encoding utf 8 sig index true,sep df pd.read csv a.csv...