Clickhouse導數工具waterdrop用法

假定我們的資料已經儲存在hive中，我們需要讀取hive表中的資料並篩選出我們關心的字段，或者對欄位進行轉換，最後將對應的字段寫入clickhouse的表中。

hive schema

我們在hive中儲存的資料表結構如下，儲存的是很常見的nginx日誌

hive的建表語句如下：

create table `nginx_msg_detail` (`hostname` string, `domain` string, `remote_addr` string, `request_time` float,` datetime ` string, `url` string, `status `int ,`data_size` int, `referer` string, `cookie_info` string, `user_agent` string, `minute ` string ) partitioned by(` date ` string, `hour

` string)

clickhouse schema

我們的clickhouse建表語句如下，我們的表按日進行分割槽

create table cms.cms_msg ( date date ,datetime datetime ,url string, request_time float32, status string, hostname string, domain string, remote_addr string, data_size int32 )engine = mergetree partition bydate orderby( date , hostname) settings index_granularity =

16384

waterdrop with clickhouse

接下來通過waterdrop將hive中的資料寫入clickhouse中。

waterdrop是通過spark引擎來進行資料匯入（spark-sql）

waterdrop pipeline

我們僅需要編寫乙個waterdrop pipeline的配置檔案即可完成資料的匯入。

配置檔案包括四個部分，分別是spark、input、filter和output。

在waterdop的conf目錄下建立 vim config/batch.conf

#spark的submit引數，可手動配置 spark #hive的查詢語句（table_name為spark的臨時表，名字隨意） input }#可以過濾掉input中不需要寫入clickhouse的字段（無過濾可不填寫） filter }#clickhouse的引數配置 output

}

output還有部分引數選配：

clickhouse.socket_timeout = 60000 --超時時間

bulk_size = 20000 --批次大小，預設2萬條，可適當加大

retry_codes = [209, 210]

retry = 3 --重試次數

執行命令，指定配置檔案，執行waterdrop，即可將資料寫入clickhouse。這裡我們以本地模式為例。

./bin/start-waterdrop.sh --config config/batch.conf -e client -m 'local[2]'

在yarn集群上執行waterdrop

Clickhouse導數工具waterdrop用法

Oracle 用dblink 跨庫導資料

用exp導資料時遇到oracle 1455的錯誤

異構資料庫之間用SQL語句導資料

Clickhouse導數工具waterdrop用法

Oracle 用dblink 跨庫導資料

用exp導資料時遇到oracle 1455的錯誤

異構資料庫之間用SQL語句導資料

相關推薦