Hive Hive 中的虛擬列以及使用場景示例

hive 在 0.8 之後提供了幾個虛擬列，虛擬列在平時作用並不大，

但是對於hive, 前序etl中可能由邏輯等產生的清洗異常，還是有很大幫助的，可以快速定位出錯的檔案！！！

在實際使用中，我遇到了這樣的問題，在清洗日誌中，由於上層的日誌清洗導致資料的某些列過長，

此時需要快速定位出錯的檔案。這個時候就可以用到虛擬列了。

hive 的虛擬列主要有以下幾個引數

input__file__name

block__offset__inside__file

row__offset__inside__block（預設不開啟，需設定引數）

注意每個 __ 都是兩個下劃線~

這3個字段解釋

input__file__name :

進行劃分的輸入檔名

block__offset__inside__file :

檔案中的塊內偏移量

row__offset__inside__block : （預設不開啟）

檔案的行偏移量

第三個引數需要手動開啟：

需要開啟 hive.exec.rowoffset 選項。

先進行選項查詢，檢視引數是否開啟：

連線beeline 客戶端 beeline -u jdbc:hive2://bigdata6:10000 -n cloudera-scm

可以看到引數預設沒有開啟，我們要開啟此引數，如下所示：

場景實戰

已知clickcube_mid表中有乙個字段 regioncode , regioncode 描述了乙個ip對應的region資訊，這個regioncode 目前使用的是原始值，為日誌中直接獲取。

某一天，由於regioncode 異常，導致spark 程序中斷，查詢得知是 regioncode 不合理導致，此時我們需要找到錯誤的regioncode, 可以進行如下的查詢：

Hive hive表中的資料匯出

insert overwrite local directory export servers exporthive a select from score insert overwrite local directory export servers exporthive row format d...

hive Hive中4種排序的區別

共有四種排序 order by，sort by distribute by，cluster by order by 全域性排序對輸入的資料做排序，故此只有乙個reducer 多個reducer無法保證全域性有序只有乙個reducer，會導致當輸入規模較大時，需要較長的計算時間 sort by 非...

oracle 11g中的虛擬列

在oracle 11g中,支援虛擬列,注意虛擬列是可以根據其他列動態計算出來的,語法 column name datatype generated always as expression virtual 例子 create table employee empl id number,empl nm...

Hive Hive 中的虛擬列 以及 使用場景示例

Hive hive表中的資料匯出

hive Hive中4種排序的區別

oracle 11g中的虛擬列

相關推薦

Hive Hive 中的虛擬列以及使用場景示例