CDH中Sqoop的使用心得

cdh自帶兩個版本sqoop元件

這裡選擇1.4.6版本也就是sqoop1，1.99.5版本是sqoop2，是半成品，不支援關係型db到hive跟hbase，故不推薦使用

sqoop list-databases --connect jdbc:mysql: --username root –p111111

sqoop eval --connect jdbc:mysql: --username root--password 111111/ --query "select xi.*, jing.name,wang.latitude,wang.longitude / from xi ,jing, wang / where xi.id=jing.foreignid and wang.id=xi.id and xi.date>='2015-09-01' and xi.date<='2015-10-01'"

以上sqoop語句執行過後，可以確認sqoop執行正常，sqoop連線mysql正常。

qoop eval --connect jdbc:mysql: --username root--password 111111 / 
--query "select xi.*, jing.name,wang.latitude,wang.longitude / 
from xi ,jing, wang / 
where xi.id=jing.foreignid and wang.id=xi.id and xi.date>='2015-09-01' and xi.date<='2015-10-01' / 
and /$conditions" / 
--split-by date --hive-import -m 5 / 
--target-dir /user/hive/warehouse/anqi_wang / 
--hive-table anqi_wang

注意：

由於使用sqoop從mysql匯入資料到hive需要指定target-dir，因此匯入的是普通表而不能為外部表。

以下簡要列舉了sqoop的執行過程:

boundin**alsquery: select min(date), max(date) from (select xi.*, jing.name,wang.latitude,wang.longitude from xi ,jing, wang where xi.id=jing.foreignid and wang.id=xi.id and xi.date>='2015-09-01' and xi.date<='2015-10-01' and (1 = 1) ) as t1 15/10/13 13:11:47 info mapreduce.jobsubmitter: number of splits:5 15/10/12 13:40:28 info mapreduce.job: map 0% reduce 0% 15/10/12 13:40:39 info mapreduce.job: map 20% reduce 0% 15/10/12 13:40:40 info mapreduce.job: map 40% reduce 0% 15/10/12 13:40:47 info mapreduce.job: map 60% reduce 0% 15/10/12 13:40:48 info mapreduce.job: map 80% reduce 0% 15/10/12 13:40:52 info mapreduce.job: map 100% reduce 0%

可以看出，–split-by設定後，job按設定值切分，切分個數為-m設定值（-m 5 不設定的話預設job切分數是4）。經檢驗，此種較複雜的sql語句，sqoop支援得很好。

可以看出mysql的decimal型別變成了hive中的double型別。此時需要在匯入時通過–map-column-hive 作出對映關係指定，如下所示：

sqoop import / 
--connect jdbc:mysql: --username anqi --password anqi_mima / 
--query "select * from xi where date>='2015-09-16' and date<='2015-10-01' / 
and /$conditions" / 
--split-by date --hive-import -m 5 / 
--map-column-hive cost="decimal",date="date" / 
--target-dir /user/hive/warehouse/xi / 
--hive-table xi

以上命令可以執行成功，然而hive列型別設定為decimal時，從mysql[decimal(12,2)]–>hive[decimal]會導致匯入後小數丟失。

事實上，在生產環境中，系統可能會定期從與業務相關的關係型資料庫向hadoop匯入資料，匯入數倉後進行後續離線分析。故我們此時不可能再將所有資料重新導一遍，此時我們就需要增量資料匯入這一模式了。

sqoop import /
--connect jdbc:mysql: /
--username root /
--password 123456 /
--query 「select order_id, name from order_table where /$conditions」 /
--target-dir /user/root/orders_all / 
--split-by order_id /
-m 6  /
--check-column order_id /
--last-value 5201314

重要引數說明：

此方式要求原有表中有time欄位，它能指定乙個時間戳，讓sqoop把該時間戳之後的資料匯入至hadoop（這裡為hdfs）。因為後續訂單可能狀態會變化，變化後time字段時間戳也會變化，此時sqoop依然會將相同狀態更改後的訂單匯入hdfs，當然我們可以指定merge-key引數為orser_id，表示將後續新的記錄與原有記錄合併。

將時間列大於等於閾值的資料增量匯入hdfs

sqoop import /
--connect jdbc:mysql: 20.160:3306/testdb /
--username root /
--password transwarp /
--query 「select order_id, name from order_table where /$conditions」 /
--target-dir /user/root/order_all / 
--split-by id /
-m 4  /
--incremental lastmodified /
--merge-key order_id /
--check-column time /
# remember this date !!!
--last-value 「2014-11-09 21:00:00」

重要引數說明：

我們知道通過 -m 引數能夠設定匯入資料的 map 任務數量，即指定了 -m 即表示匯入方式為併發匯入，這時我們必須同時指定 - -split-by 引數指定根據哪一列來實現雜湊分片，從而將不同分片的資料分發到不同 map 任務上去跑，避免資料傾斜。

CDH中Sqoop的使用心得

Sqoop的一些使用心得

MFC中CObList的使用心得

PHP中json encode的使用心得

CDH中Sqoop的使用心得

Sqoop的一些使用心得

MFC中CObList的使用心得

PHP中json encode的使用心得

相關推薦