利用python對巨量資料排序

我們有乙份100g左右的資料需要根據關鍵字進行排序，當時想的是直接從資料庫select出來的時候直接order by，但是爆記憶體了，於是考慮匯出後直接利用python進行排序。

直接利用切割排序，再合併的方式，將100g檔案分為40個2.5g的資料檔案，分別排序後再歸併，思想和leetcode合併n個有序陣列的想法如出一轍

import glob
import heapq
if __name__ ==
'__main__'
:   
csv_list = glob.glob(
'./csv/*.csv'
)print
('find %s csv files'
%len
(csv_list)
)# if csv file less than 2,we don't need to merge, exit the script
iflen
(csv_list)
<2:
return
0# open csv file, store the file_handler
print
('processing............'
)    file_handler =
for i in csv_list:
print
('opening '
+str
(i))
fr =
open
(i,'rt'
)# merge all files, sort by ad_id whose index is 120 
res = heapq.merge(file_handler[0]
, file_handler[1]
, key =
lambda x:
int(x.split(
',')
[120])
)for i in
range(2
,len
(file_handler)):
res = heapq.merge(res, file_handler[i]
, key =
lambda x:
int(x.split(
',')
[120])
)# cnt: count the record that had been written to the file
# file_ptr: pointer of opening file to be wrriten
# contfile: new file number
cnt =
0    file_ptr =
''    cntfile =
0for line in res:
if cnt ==0:
print
("creating file "
+str
(cntfile)
)            file_ptr =
open
('./csv_sort/file_'
+str
(cntfile)
+'.csv'
,'w'
)        file_ptr.write(line)
cnt+=
1if cnt%
20000==0
:print
("already writing : "
+str
(cnt)
)if cnt ==
540000
:print
('file '
+str
(cntfile)
+' done'
)            cnt=
0            file_ptr.close(
)            cntfile+=
1# close the last file
if cnt!=0:
print
('file '
+str
(cntfile)
+' done'
)        file_ptr.close(
)# close all input files
for fr in file_handler:
fr.close(
)print
('done'
)

利用臨時表對查詢資料重新排序

先看乙個查詢 select top 3 id title from table1 where id 5 order by id asc 表中有id 1,2,3,4,5的幾行資料，本來想按順序得到id 4,3,2的資料行，但該查詢實際得到的是id 2,3,4順序的資料行。當然，可在程式中對資料集重新排...

利用TreeMap對map進行排序

treemap是可以根據鍵對map進行排序的，注意是根據鍵。一般來講，鍵可以使integer或者是string，但是也可以是物件，但是該物件的實現類必須實現comparable介面。class mycompare implements comparable override public stri...

python 對字典排序

對字典進行排序？這其實是乙個偽命題，搞清楚python字典的定義字典本身預設以key的字元順序輸出顯示就像我們用的真實的字典一樣，按照abcd字母的順序排列，並且本質上各自沒有先後關係，是乙個雜湊表的結構但實際應用中我們確實有這種排序的需求按照values的值排序輸出，或者按照別的奇怪...

利用python對巨量資料排序

利用臨時表對查詢資料重新排序

利用TreeMap對map進行排序

python 對字典 排序

相關推薦

python 對字典排序