讀取excel PySpark讀取Excel

日常工作中，客戶通過excel提供資料是一種很常見的方式，既然碰到了就得解決。

我常用的辦法就是pandas讀取，並儲存為parquet，

# 如果只讀取乙個sheet，import pandas as pddf=pd.read_excel("excel1.xlsx")df.to_parquet("excel_etl/excel1.parquet")

# 如果乙個excel有多個sheetimport pandas as pdxl=pd.excelfile("多sheetexcel.xlsx")sheets=xl.sheet_namesfor sheet in sheets:    print(sheet)    df=xl.parse(sheet)    df.to_parquet(f"excel_etl/.parquet")

如果所有sheets格式一致，pyspark可以輕鬆一次讀取全部資料，

from pyspark.sql import sparksessionspark = sparksession.builder\.master("local[*]")\.getorcreate()

#只需要讀取整個目錄即可df=spark.read.parquet("excel_etl")#也可以通過正規表示式來選擇性讀取自己想讀取的parquet# df=spark.read.parquet("excel_etl/*.parquet")

另外也可以使用spark excel外掛程式(均基於poi)來讀取，這裡介紹兩款，

from pyspark import sparkconfconf=sparkconf()\.set("spark.jars.packages","com.crealytics:spark-excel_2.11:0.11.1")\.set("spark.sql.shuffle.partitions", "4")\.set("spark.sql.execution.arrow.enabled", "true")\.set("spark.driver.maxresultsize","6g")\.set('spark.driver.memory','6g')\.set('spark.executor.memory','6g')from pyspark.sql import sparksessionspark = sparksession.builder\.config(conf=conf)\.master("local[*]")\.getorcreate()xlsx="online retail.xlsx"df = spark.read\    .format("com.crealytics.spark.excel")\    .option("useheader", "true")\    .option("treatemptyvaluesasnulls", "false")\    .option("inferschema", "true") \    .option("timestampformat", "mm-dd-yyyy hh:mm:ss")\    .option("maxrowsinmemory", 20)\    .option("excerptsize", 10)\    .load(xlsx)    df.printschema()

root
|-- invoiceno: double (nullable = true)
|-- stockcode: string (nullable = true)
|-- description: string (nullable = true)
|-- quantity: double (nullable = true)
|-- invoicedate: timestamp (nullable = true)
|-- unitprice: double (nullable = true)
|-- customerid: double (nullable = true)
|-- country: string (nullable = true)

# 讀入的字段型別有誤，需要做適當調整import pyspark.sql.functions as fdf=df.withcolumn("invoiceno",f.col('invoiceno').cast("string"))df.write.parquet("online retail",mode="overwrite")

# excel行數太多，測試失敗，行數較少的時候沒有問題# 另外非常耗資源，沒有在集群上做過測試from pyspark import sparkconfconf=sparkconf()\.set("spark.jars.packages","com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.3.0")\.set("spark.sql.shuffle.partitions", "4")\.set("spark.sql.execution.arrow.enabled", "true")\.set("spark.driver.maxresultsize","6g")\.set("spark.sql.execution.arrow.enabled", "true")\.set('spark.driver.memory','6g').set('spark.executor.memory','6g')from pyspark.sql import sparksessionspark = sparksession.builder\.config(conf=conf)\.master("local[*]")\.getorcreate()df=spark.read.format('org.zuinnote.spark.office.excel')\.option("read.locale.bcp47","zh-hans")\.option("read.spark.******mode",true)\.option("read.header.read",true)\.load("online retail.xlsx")df.printschema()df.write.parquet("online retail",mode="overwrite")

參考資源，

歷史文章：

17個新手常見python執行時錯誤

pyspark 之批量執行sql語句

python基礎入門教程《python入門經典》

開源bi metabase與spark sql的碰撞

為了更好的服務資料圈內同學，我們需要更多的志願者，主要協助推廣**，尋找更多更好的內容，有興趣的同學可以聯絡l23683716，加志願者群。

求職招聘，技術交流：

innodb 鎖定讀取（當前讀）

當事務a用常規的select語句查詢資料後想做更新或插入操作，但常規的select語句不能對這些資料提供很好的保護。其他的事務可以在a查詢之後對查詢結果進行更新或刪除的操作。鎖定讀取語句 innodb額外的提供了兩種鎖定讀取保證資料的安全 1 在讀取的任何行上設定共享模式鎖定。select.lock...

python讀取檔案，讀檔案的前幾行

python中怎麼讀取前幾行資料呢有2種方法 1.這種比較簡單 n int input f open test.txt r for i in range n print f.readline strip 2.這種就比較麻煩了，我覺得python最大的優點就是內建函式比較多所以還是比較推薦第一種呀...

golang檔案讀取按指定BUFF大小讀取方式

a.txt檔案內容 abcdefghi hello golang package main import fmt os io func main defer file.close stat,err file.stat if err nil var size stat.size fmt.println...

讀取excel PySpark讀取Excel

innodb 鎖定讀取（當前讀）

python讀取檔案，讀檔案的前幾行

golang檔案讀取 按指定BUFF大小讀取方式

相關推薦

golang檔案讀取按指定BUFF大小讀取方式