Hive中SQL的優化技巧

hive中sql的優化技巧，核心思想是避免資料傾斜。

1、避免在同乙個查詢中同時出現count, distinct,group by

2、left join 時把小資料量的表放在前面

3、盡量使用子查詢

引數配置

set mapred.reduce.tasks=50;
set mapreduce.reduce.memory.mb=6000;
set mapreduce.reduce.shuffle.memory.limit.percent=0.06;

涉及資料傾斜的話，主要是reduce中資料傾斜的問題，可能通過設定hive中reduce的並行數，reduce的記憶體大小單位為m，reduce中 shuffle的刷磁碟的比例，來解決。

例項一

--分月
select substr(a.day,1,6)month,count(distinct a.userid)  
from dms.tracklog_5min a  
join default.site_activeuser_tmp c
on a.userid=c.id
where a.day>='201505' and a.day<'201506'
group by substr(a.day,1,6) ;
--優化後
select '201505',count(*) from 
(select distinct c.userid
from 
(select userid   from default.site_activeuser_tmp where month='201505') c
left join
(select userid  from
dms.tracklog_5min 
where day>='201505' and day<'201506'  
) tmp 
on tmp.userid=c.userid
) t;

例項二

--分事業部
select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name 
from dms.tracklog_5min a join   default.d_channel b
on a.host=b.host  
join default.site_activeuser_tmp c
on a.userid=c.id
where a.day>='201505' and a.day<'201506'
group by substr(a.day,1,6),b.dept_name;
--優化後
set mapred.reduce.tasks=50;
set mapreduce.reduce.memory.mb=6000;
set mapreduce.reduce.shuffle.memory.limit.percent=0.06;
select "201505" month,count(t.userid),t.dept_name 
from 
(select userid from default.site_activeuser_tmp where month='201505') c
left join
(select distinct a.userid userid,b.dept_name dept_name from default.d_channel b
left join 
(select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a
on a.host=b.host  
)ton t.userid=c.userid
group by t.dept_name ;

例項三

--分產品
select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name,b.prod_name 
from dms.tracklog_5min a join   default.d_channel b
on a.host=b.host  
join default.site_activeuser_tmp c
on a.userid=c.id
where a.day>='201505' and a.day<'201506'
group by substr(a.day,1,6),b.dept_name,b.prod_name;
--優化後
select "201505" month,count(t.userid) cnt,t.dept_name dept_name,t.prod_name prod_name
from 
(select userid from default.site_activeuser_tmp where month='201505') c
left join
(select distinct a.userid userid,b.dept_name dept_name,b.prod_name prod_name from default.d_channel b
left join 
(select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a
on a.host=b.host  
)ton t.userid=c.userid
group by t.prod_name,t.dept_name ;

SQL的優化技巧

一一些常見的sql實踐 1 負向條件查詢不能使用索引 not in not exists都不是好習慣可以優化為in查詢 2 前導模糊查詢不能使用索引而非前導模糊查詢則可以 3 資料區分度不大的字段不宜使用索引原因性別只有男，女，每次過濾掉的資料很少，不宜使用索引。經驗上，能過濾80 資料時...

sql優化技巧

1.比較運算子能用就不用增加了索引的使用機率 2.事先知道只有一條查詢結果時，使用 limit 1 limit 1 可以避免全表掃瞄，找到對應結果就不會再繼續掃瞄了 3.選擇合適的資料型別很重要能用tinyint就不用smallint，能用smallint就不用int，磁碟和記憶體消耗越小越好...

Hive小技巧及優化

查詢除了ds 和 hr 之外的所有列 select ds hr from sales 修改表生命週期 odps alter table table name set lifecycle days 正則匹配匹配除 n 之外的任何單個字元。要匹配包括 n 在內的任何字元，請使用像 n 的模式。解析執行...

Hive中SQL的優化技巧

SQL的優化技巧

sql優化技巧

Hive小技巧及優化

相關推薦