Python連線hive資料庫小結

做大資料分析及應用過程中，時常需要面對海量的資料儲存及計算，傳統的伺服器已經很難再滿足一些運算需求，基於hadoop/spark的大資料處理平台得到廣泛的應用。本文介紹用python讀取hive資料庫的方式，其中還是存在一些坑，這裡我也把自己遇到的進行分享交流。

集團有20臺伺服器（其中1臺採集主節點，1台大資料監控平台，1臺集群主節點，17臺集群節點），65thdfs的磁碟資源，3.5t的yarn記憶體，等等。專案目前需要對集團的家庭畫像資料分析，通過其樓盤，收視節目偏好，家庭收入等資料進行區域性的分析；同時對節目畫像及樓盤詳細資料進行判斷分析。本人習慣使用r語言和python來分析，故採用了本次分享的資料獲取部分的想法。

首先是配置相關的環境及使用的庫。sasl、thrift、thrift_sasl、pyhive。

pip install sasl-0.2.1-cp36-cp36m-win_amd64.whl pip install thrift -i pip install thrift_sasl==0.3.0 -i pip install pyhive -i

from pyhive import hive
import pandas as pd
# 讀取資料
def select_pyhive(sql):
# 建立hive連線
conn = hive.connection(host='10.16.15.2', port=10000, username='hive', database='user')
cur = conn.cursor()
try:
#c = cur.fetchall()
df = pd.read_sql(sql, conn)
return df
finally:
if conn:
conn.close()
sql = "select * from user_huaxiang_wide_table"
df = select_pyhive(sql)

獲取到hive資料庫中約193w的家庭畫像資料，37個字段。

可以看出**並不是很複雜，但是大家在測試時可能會出現以下兩種常見的問題。

解決一：

pip install thrift_sasl==0.3.0 -i ，更新依賴thrift_sasl包到0.3.0即可

impala方式連線hive資料庫，但是資料量過大會導致python卡死，目前還未找到合適方式解決。

首先是配置相關的環境及使用的庫。sasl、thrift、thrift_sasl、impala。

pip install sasl-0.2.1-cp36-cp36m-win_amd64.whl pip install thrift -i pip install thrift_sasl==0.2.0 -i pip install impala -i pip install thriftpy -i

from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
# 獲取資料
def select_hive(sql):
# 建立hive連線
conn = connect(host='10.16.15.2', port=10000, auth_mechanism='plain',user='hive', password='user@123', database='user')
cur = conn.cursor()
try:
#cur.execute(sql)
c = cur.fetchall()
df = as_pandas(cur)
return df
finally:
if conn:
conn.close()
data = select_hive(sql = 'select * from user_huaxiang_wide_table limit 100')

這個impala方式也是很方便，但是當資料量到達一定程度，則就會在fetchall處一直處於執行狀態，幾個小時也沒有響應。

Python連線hive資料庫小結

HIVE倉庫擴充套件連線hive資料庫

spark連線預設hive資料庫

Python庫之資料庫連線

Python連線hive資料庫小結

HIVE倉庫擴充套件 連線hive資料庫

spark連線預設hive資料庫

Python庫之資料庫連線

相關推薦

HIVE倉庫擴充套件連線hive資料庫