Impala實踐之十一 parquet效能測試

之前一直考慮更換impala的檔案儲存格式為parquet，但是沒有立即使用，最近又做了一些測試，看看parquet是否真的有用。在測試的時候順便測了一下compute語句的效果，一起作為參考。下面抽出乙個小業務的部分測試結果來展示。

庫名和表名當然不是真的。

表名行數

字段數物理儲存大小

ain34231137

111.4 g

a_in

395857172

114.4 g

in62025197

62.5 g

c4055068

144708.3 m

這個記錄是當時隨手測的乙個結果。

select
count(*) from c;

檔案格式

第1次執行耗時

第2次執行耗時

text

7.72s

0.74s

parquet

5.90s

0.53s

select
count(uid) from c
where ***

檔案格式

where字句數量

持續時間

讀取hdfs位元組數

累積記憶體使用峰值

text

1826ms

3g361.1m

parquet

1623ms

17.1 m

6.9m

text

21.04s

3g112.3 m

parquet

2623ms

17.6 m

7.1 m

text

3930ms

3g112.3 m

parquet

3631ms

18.4 m

7.6 m

text

6961ms

3g120.6 m

parquet

6836ms

22.9 m

45.1 m

text

131.04s

3g117 m

parquet

131.04s

33.2 m

19.5 m

dev表是另外乙個表，不是parquet格式。

select substr(a1.dt,1,7) dt, count(distinct a1.uid)
from (
select userid uid , createtime dt
from dev) a1
left
join (
select uid, dt
from (
select userid uid, time dt from a_in
union
allselect uid uid, stime dt from ain
where atype='1'
union
allselect uid, time dt
from c
where state!=0
and source='test') a1 ) a2
on a1.uid = a2.uid and substr(a1.dt,1,7)>substr(a2.dt,1,7)
left
join (
select uid, dt
from (select userid uid, time dt from
inunion
allselect uid, time dt from c
where state!=0
and source='pc') a1 ) a3
on a1.uid = a3.uid and substr(a1.dt,1,7)>substr(a3.dt,1,7)
where a2.uid is
null
and a3.uid is
notnull
group
by dt
order
by dt;

檔案格式

持續時間

讀取hdfs位元組數

累積記憶體使用峰值

text

12分38秒

71.4g

27.5g

parquet

12分27秒

22.5g

27.6g

這個稍微複雜一些，用到了上面的三張表，有一些join操作。因為前段時間發現了compute語句的神奇，因此這次順便帶上它。

select substr(a1.dt,1,7) dt, count(distinct a1.uid)
from (
select uid, createtime dt
from c
where state!=0
inner
join (
select uid, dt
from (
select userid uid, logtime dt from a_in
union
allselect uid uid, stime dt from ain
where atype='1') a1 ) a2
on a1.uid = a2.uid and substr(a1.dt,1,7) = substr(a2.dt,1,7)
group
by dt
order
by dt

檔案格式

提前執行compute

持續時間

讀取hdfs位元組數

累積記憶體使用峰值

text

n5分16秒

46.7g

12.1g

parquet

n3分48秒

1.7g

27.3g

text

y34.9秒

46.7g

1.5g

parquet

y14.5秒

1.7g

1.1g

2016-04-27 14:55:00 hzct

Impala實踐之十五 Impala使用文件

由於前期大家使用impala的時候都比較隨意，再加上對impala的原理不清楚，因此在使用的過程中對impala帶來了很大的壓力。經過前段時間的研究和實驗。我整理了乙份impala使用文件，供組內小夥伴使用。只有通過hdfs增加或刪除分割槽中檔案後，才需要人為更新元資料，其餘情況依賴impala自帶...

Impala實踐之十三 Impala建表時的關鍵字

由於經常要幫資料分析抽表，因此自己寫了個自動生成impala和sqoop指令碼的工具，結果今天發現乙個庫中17張表，只成功匯入了12張。仔細檢查才發現是是由於impala建表時候字段使用了location關鍵字的原因。建表語句 impala shell i ip 25004 q drop table...

Impala實踐之五一次系統任務堵塞記錄思考

前言前段時間，imppala資源告警，各種任務失敗，查詢堵塞，因此公司集群公升級。這次遷移的確必須，因為當時的集群規模很小，資源太緊張了。遷移集群後，今天集群再次出問題，導致乙個下午沒什麼事都沒乾，查了一下午的錯誤。事件發展 1.階段一下午2點17分資料組反映集群崩潰，hue介面不能登入，登入...

Impala實踐之十一 parquet效能測試

Impala實踐之十五 Impala使用文件

Impala實踐之十三 Impala建表時的關鍵字

Impala實踐之五 一次系統任務堵塞記錄 思考

相關推薦

Impala實踐之五一次系統任務堵塞記錄思考