spark 匯入檔案到hive出現多小檔案的問題

環境：

ambari:2.6.1

spark 2.1

python 3.6

oracle 11.2

sqoop 1.4

將sqoop採集到hdfs中的檔案匯入到hive資料庫，匯入成功後，發現hive資料庫中出現多個小檔案的問題，嚴重影響後期資料分析的載入速度。

解決方法：

sjtable = spark.sql("select  *          from " + tablename + "_tmp where att = '1e'")
datanum = sjtable.count()
#解決小檔案
sjtable_tmp = sjtable.repartition(1).persist()
sjtable_tmp.createorreplacetempview(tablename + "_cpu_tmp")
spark.sql("insert into table " + tablename + "_cpusj partition(area,timdate) select  lcn,pid,tim,tf,fee,bal,epid,etim,card_type,service_code,is_area_code,use_area_code \
,clea_day,current_timestamp,use_area_code as area,substr(tim,1,6) as timdate from " + tablename + "_cpu_tmp")

修改後的檔案：

Spark實現HIVE統計結果匯入到HBase操作

由於hive更新的機制極其不適應spark環境，於是利用hbase來執行hive中某些統計結果的更新。首先要做的是實現spark hive訪問，得到rdd，再將這個rdd匯入到hbase中操作。然而網上關於這一塊目前資料還真很少。但是其原理總體上來說是非常簡單的。步驟主要是兩步 1 開啟hive聯結...

hive匯入資料到hbase

hive有一張表user tag detail，表中資料約1.3億，需要將改表資料匯入到hbase 嘗試了兩種方式建立關聯表 create table hbase user tag detail id string,name string 插入資料 insert overwrite table h...

將csv或者Excel檔案匯入到hive

1.將csv或excel檔案轉換為文字，以逗號或者製表符都可以 xigua.txt id,color,root,stroke,venation,umbilical,touch,status 1,青綠,蜷縮,濁響,清晰,凹陷,硬滑,是 2,烏黑,蜷縮,沉悶,清晰,凹陷,硬滑,是 3,烏黑,蜷縮,濁響,...

spark 匯入檔案到hive出現多小檔案的問題

Spark實現HIVE統計結果匯入到HBase操作

hive匯入資料到hbase

將csv或者Excel檔案匯入到hive

相關推薦