Hive hive一種通用的上億級別的去重方法

前些陣子在公司做專案遇到了乙個問題，就是需要都行業中的所有品牌的uid進行去重的然後計數的操作。

資料量去完重複大概2個億，去之前大概將近三個億。

做法一：最原始的做法使用的是count(distingct uid)這個需要大概跑3個小時的任務。

做法二：使用group by去重，效果依然不好。

做法三：使用row_number() over(partition by uid order by uid desc) as rn ，然後取rn=1，這樣也不行。

通用做法：將任務分成5份，即uid%5=0,1,2,3,4這幾個任務去跑，然後進行union all和並即可。任務從三小時降到0.5小時。

**：開啟5個以下任務，uid%5=0,1,2,3,4 五種情況，寫到wb_ad_brand_industry_count_temp1，2，3，4，5

#!/bin/bash
source /usr/local/jobclient/config/.hive_config.sh
source /usr/local/jobclient/lib/source $0 $1
source /usr/local/jobclient/demo/execute_modular.sh $work_log_notice
source ./mysql_comm.sh
source ./date_comm.sh
if [ $? -ne 0 ]
then
exit 255
fiworkpath=$(dirname $(dirname $0))
#hive paramter
hive_db='default'
cust_mds_user_dis_info='cust_mds_user_dis_info'
wb_ad_brand_cust_industry_map='wb_ad_brand_cust_industry_map'
function make_brand_industry_count 
write_log "make_brand_industry_count start"
make_brand_industry_count

最終合併：

#!/bin/bash
source /usr/local/jobclient/config/.hive_config.sh
source /usr/local/jobclient/lib/source $0 $1
source /usr/local/jobclient/demo/execute_modular.sh $work_log_notice
source ./mysql_comm.sh
source ./date_comm.sh
if [ $? -ne 0 ]
then
exit 255
fiworkpath=$(dirname $(dirname $0))
#hive paramter
hive_db='default'
cust_mds_user_dis_info='cust_mds_user_dis_info'
wb_ad_brand_cust_industry_map='wb_ad_brand_cust_industry_map'
wb_ad_brand_industry_count='wb_ad_brand_industry_count' 
function make_brand_industry_count 
write_log "make_brand_industry_count start"
make_brand_industry_count

一種通用CMakeLists模板

1.cmake verson，指定cmake版本 cmake minimum required version 3.16 2.project name，指定專案的名稱，一般和專案的資料夾名稱對應 project demo 3.head file path，頭檔案目錄 include director...

Makefile的一種通用寫法

管理linux環境下的c c 大型專案，如果有乙個智慧型的build system會起到事半功倍的效果，本文描述linux環境下大型工程專案子目錄makefile的一種通用寫法，使用該方法，當該子目錄內的檔案有增刪時無需對makefile進行改動，可以說相當的智慧型。下面先貼為減小篇幅，一些非關鍵...

一種Java通用的FeatureMap訪問設計

首先，定義map物件，將map的value型別指定為object，它可以儲存任意基礎型別或自定義型別的值物件。由於值型別為object，在讀取map元素時，往往需要將值強制轉換為需要的型別。其實，可以返回值型別為范型優化將型別轉換前置到get方法中，如下 created by jerry on 1...

Hive hive一種通用的上億級別的去重方法

一種通用CMakeLists模板

Makefile的一種通用寫法

一種Java通用的FeatureMap訪問設計

相關推薦