hive if語句 Hive實踐（高階篇）

1.1、hive sql執行原理

簡單來說hive就是乙個查詢引擎，通過語法分析、語法解析、語法優化等一系列操作將sql轉化為mapreduce作業，mr作業一般會有以下幾個過程：

1.2、hive sql優化總結

常用sql優化手段，分幾個方面：

1.2.1 業務場景優化

1.2.2 語句本身優化

。優先過濾後再進行join操作,減少參與join的資料量

。union all代替join並行執行

。多表join時，盡量用相同key

。空值、無意義值過濾或者用隨機數打散避免傾斜

1.2.3 引數配置優化

1.3、hive sql優化示例

1.3.1引數設定優化

1.3.2 邏輯優化

1) 按需取資料，指定獲取的列、指定取資料的分割槽

--按需取資料，指定獲取的列、指定取資料的分割槽select    ptdate, pay_type  --獲取需要的字段，不需要不from  tmp.dualwhere ptdate = '2020-09-21' --指定查詢資料的分割槽，一般是日期 and type = '***'    --過濾需要的資料

2)cube 以及 distinct的優化

--優化前2-3小時--優化後45min左右select id    , nvl(type, "合計") type    , nvl(plat, "合計")plat    , nvl(is_new, "合計")is_new    , nvl(flag, "合計") flag    , nvl(name, "合計")name    , scene    , count(user_id) exposure_product_uv    , sum(exposure_product_pv) exposure_product_pv    , sum(unique_product_exposure_pv) unique_product_exposure_pvfrom(select     id    , nvl(type, "合計") type    , nvl(plat, "合計")plat    , nvl(is_new, "合計")is_new    , nvl(flag, "合計")flag    , nvl(name, "合計")name    , scene    , user_id    , sum(pv) exposure_product_pv    , count(1) unique_product_exposure_pvfrom (    select id        , nvl(type, "合計") type        , nvl(plat, "合計")plat        , nvl(is_new, "合計")is_new        , nvl(flag, "合計")flag        , nvl(name, "合計")name        , scene        , user_id        , request_id        , count(1) pv    from tmp_exp_log    where event = 'exposure_product'    group by id        , flag        , name        , is_new        , plat        , scene        , type        , user_id        , request_id    grouping sets(        (id, plat, is_new, flag, name, scene, user_id, request_id)        , (id, plat, is_new, scene, user_id, request_id)        , (id, plat, scene, user_id, request_id)        , (id, is_new, scene, user_id, request_id)        , (id, name, scene, user_id, request_id)        , (id, flag, scene, user_id, request_id)        , (id, scene, user_id, request_id)        , (id, plat, is_new, flag, name, scene, type, user_id, request_id)        , (id, plat, is_new, scene, type, user_id, request_id)        , (id, plat, scene, type, user_id, request_id)        , (id, is_new, scene, type, user_id, request_id)        , (id, name, scene, type, user_id, request_id)        , (id, flag, scene, type, user_id, request_id)        , (id, scene, type, user_id, request_id)    )  )bgroup by   id   , flag   , name   , is_new   , plat   , scene   , type   , user_id) agroup by id    , flag    , name    , is_new    , plat    , scene    , type;

3)關聯欄位加隨機數，避免特殊值導致的笛卡爾積

--優化前1個多小時跑不完--優化後20min出結果select  l.*,   ......from    tmp.dual1  lleft join tmp.dual2  u   on (l.id = u.id and u.date = '2020-11-01')left join tmp.dual3 p   on (if(l.pcode='-',rand(10),l.pcode) = p.pcode)left join tmp.dual4 w   on (if(l.wcode='-',rand(10),l.wcode) = w.wcode)left tmp.dual5  el   on (if(l.eid='-',rand(10),l.eid) = el.eid)where l.date = '2020-11-01'

4)join的子查詢維度一致可以改為union all+group by 減少關聯，並行執行

select   date,   sum(amt) as amt,   sum(uv) as uv from   (    select       f.date,       sum(f.amt) as amt,       0 as uv     from       tmp.dual1 f     where       1 = 1       and f.date = '2020-10-11'     group by       f.date     union all     select       f.date,       0 as amt,       count(distinct f.user_id) as uv     from       tmp.dual2 f     where       1 = 1       and f.date = '2020-10-11'     group by       f.date  ) uniontable group by   date

總結：hive sql的優化就是通過各種方式避免資料傾斜、資料冗餘、job或io過多，高效利用集群的併發特性。

Hive工程實踐

最近在參與某tob專案，資料需離線統計出並推送至線上業務庫，其中用hive做的離線分析。總結寫下常見問題及心得吧。一.工程類技術範疇資料統計工作大題劃分為四步指標統計批量指令碼資料格式異常流程 step2.批量指令碼將step1建立的各張表綜合成批量執行的perl指令碼複雜度在於若執...

Hive修改表語句

0x01 重新命名表 1altertabletable name renametonew table name 上面這個命令可以重新命名表，資料所在的位置和分割槽都沒有改變。0x02 改變列名型別位置注釋 1altertabletable name change 2 cloumn col ol...

hive if語句 Hive實踐（高階篇）

Hive工程實踐

Hive修改表語句

Hive修改表語句

相關推薦