hive 桶表的使用

對於每乙個表或者分割槽來說，可以進一步組織成桶，其實就是更細粒度的資料範圍。

bucket是對指定列進行 hash，然後根據hash值除以桶的個數進行求餘，決定該條記錄存放在哪個桶中。

公式：whichbucket = hash(columnvalue) % numberofbuckets

公式：要往哪個桶存 = hash(列值) % 桶的數量

hive桶表最大限度的保證了每個桶中的檔案中的資料量大致相同，不會造成資料傾斜。

但說是不會造成資料傾斜，但這是在業務無關的情況下，只要有真實業務在，肯定會發生資料傾斜。

總結：桶表就是對一次進入表的資料進行檔案級別的劃分。

案例：多個國家的使用者日誌

--建立桶表的語法格式
create external table buckets_table(
col ...
.)comment
'this is the buckets_table table'
partitioned by
(`dt` string)
clustered
by(col1)
[sorted by
(col2 [
asc|
desc])
]into
2 buckets
location '.....'

示例：

-- 建立外部表桶表 create external table user_install_status_buckets( `aid` string, `pkgname` string, `uptime` bigint,` type `int ,`country` string, `gpcategory` string) comment 'this is the buckets_table table' partitioned by (`dt` string) clustered by(country) sorted by (uptime desc )into 42 buckets location 'hdfs://ns1/user/mydir/hive/user_install_status_buckets'

;

--建partition 分割槽
alter
table user_install_status_buckets add
ifnot
exists
partition
(dt=
'20141228'
) location '20141228'
partition
(dt=
'20141117'
) location '20141117'
;

在給分割槽插入資料的時候肯定有某些國家資料量大，某些少的情況

--給分割槽插入資料
insert overwrite table user_install_status_buckets partition
(dt=
'20141228'
)select
aid,pkgname,uptime,
type
,country,gpcategory
from user_install_status_orc
where dt=
'20141228'
;insert overwrite table user_install_status_buckets partition
(dt=
'20141117'
)select
aid,pkgname,uptime,
type
,country,gpcategory
from user_install_status_orc
where dt=
'20141117'
;

桶中的資料檔案：檔案大小參差不齊，明顯有資料傾斜現象

桶表抽樣：

當資料量特別大時，對全體資料進行處理存在困難時，抽樣就顯得尤其重要了。

抽樣可以從被抽取的資料中估計和推斷出整體的特性，是科學實驗、質量檢驗、社會調查普遍採用的一種經濟有效的工作和研究方法。

桶表抽樣的語法如下：table_sample: tablesample (bucket x out of y [on colname])

當建立桶表的字段和抽樣字段一致的時候，抽樣時不掃瞄全表，直接輸入指定的桶檔案。

select
*from user_install_status_buckets tablesample(bucket 11
outof
84on country)
;

上面的語句指定抽取第11個桶的一半，但是如果第11個桶中沒有第二個country，就會把所有記錄全部抽取出來。

hive 修改分桶數分桶表 Hive中的分桶

對於每乙個表 table 或者分割槽，hive可以進一步組織成桶，也就是說桶是更為細粒度的資料範圍劃分。hive也是針對某一列進行桶的組織。hive採用對列值雜湊，然後除以桶的個數求餘的方式決定該條記錄存放在哪個桶當中。把錶或者分割槽組織成桶 bucket 有兩個理由 1 獲得更高的查詢處理效率...

hive分桶表join Hive分桶表

測試資料 95001,李勇,男,20,cs 95002,劉晨,女,19,is 95003,王敏,女,22,ma 95004,張立,男,19,is 95005,男,18,ma 95006,孫慶,男,23,cs 95007,易思玲,女,19,ma 95008,李娜,女,18,cs 95009,夢圓圓,女...

hive分桶表的學習

每乙個表或者分割槽，hive都可以進一步組織成桶，桶是更細粒度的資料劃分，他本質不會改變表或分割槽的目錄組織方式，他會改變資料在檔案中的分布方式。分桶規則對分桶字段值進行雜湊，雜湊值除以桶的個數求餘，餘數決定了該條記錄在哪個桶中，也就是餘數相同的在乙個桶中。桶為表加上額外結構，鏈結相同列劃分了桶的...

hive 桶表的使用

hive 修改分桶數 分桶表 Hive中的分桶

hive分桶表join Hive分桶表

hive分桶表的學習

相關推薦

hive 修改分桶數分桶表 Hive中的分桶