pandas處理大文字資料

當資料檔案是百萬級資料時，設定chunksize來分批次處理資料

案例：美國**競選時的資料分析

讀取資料

import numpy as np

import pandas as pd

from pandas import series,dataframe

df1 = pd.read_csv("./usa_election.csv",low_memory=false)

df1.shape

結果：(536041, 16) #可以看到資料量為536041

將資料在此進行級聯成更大的文字資料

df =pd.concat([df1,df1,df1,df1])

df.shape

結果：(2144164, 16)

%%time

ret = df.to_csv("./hehe.csv",index = false)

ret將df資料讀取到檔案中，並計算寫入時間

ret = pd.read_csv("./hehe.csv",low_memory = false,chunksize=500000)

#將寫入的大資料檔案讀出來，low_memory = false表示是否在內部一塊的形式處理檔案，chunksize表示分批次處理檔案，每次處理多少資料

ret讀取的檔案格式是：textfilereader at 0x122f30f0>

新增迴圈，讀出來資料

for x in ret:

print(type(x))

結果：

然後分批次處理資料

# 將str型別的時間轉化成為時間型別的

處理前：

處理後：

處理過程：

months =

def conver(x):

day,month,year = x.split("-") #進行切片操作

datatime = "20"+year+"-"+str(months[month])+"-"+day

return datatime #對切片重新組合

df1["contb_receipt_dt"] = df1["contb_receipt_dt"].map(conver)

df1["contb_receipt_dt"] = pd.to_datetime(df1["contb_receipt_dt"]) #轉化成時間格式

累加和的操作

# 累加和

a = np.arange(101) 隨機乙個陣列資料

display(a)

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,

13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,

26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,

39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,

52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,

65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,

78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,

91, 92, 93, 94, 95, 96, 97, 98, 99, 100])

b = a.cumsum() #求出該資料的累加和用函式cumsum()

ree=dataframe(b,columns=["num"])

ree["num"].plot() #畫出累加和的那列的圖譜

pandas文字資料

文字資料 string型別的性質 string與object的區別字元訪問方法 string accessor methods，如str.count 會返回相應資料的nullable型別，而object會隨著缺失值的存在而改變返回型別某些series 法不能在string上使如 series....

Pandas文字資料處理與時間序列

字元文字 pandas提供了一組字串函式，可以方便地對字串資料進行操作。最重要的是，這些函式忽略nan值。以下的這些方法幾乎都支援python內建的字串函式。pandas的一些方法都支援正規表示式，比如下面的replace 可以多多嘗試 xyx。函式名描述 lower 將series index中的...

Pandas 文字資料方法 slice

series.str.slice start none,stop none,step none 按下標擷取字串引數 start 整型或缺省 stop 整型或缺省 step 整型或缺省 returns 序列series 索引index series.str.slice replace start n...

pandas處理大文字資料

pandas文字資料

Pandas文字資料處理與時間序列

Pandas 文字資料方法 slice

相關推薦