(1)缺省會寫成一堆小檔案,需要將其重新分割槽,直接指定幾個分割槽
spark.sql("select *,row_number() over(partition by depid order by salary) rownum from emp ").repartition(2).write.parquet("hdfs:///user/cuixiaojie/employeerepartition")
(2)缺省會寫成一堆小檔案,需要將其重新分割槽,直接指定按照某一列的值進行分割槽
spark.sql("select *,row_number() over(partition by depid order by salary) rownum from emp ").write.partitionby(3).parquet("hdfs:///user/cuixiaojie/employee")
val data= sc.textfile("/home/shiyanlou/uber") //檔案在最後
data.first
val hd = data.first()
val datafiltered = data.filter(line => line != hd)
datafiltered.count
case class uber(dispatching_base_number:string ,date:string,active_vehicles:int,trips:int)
val df = datafiltered.map(x=>x.split(",")).map(x => uber(x(0).tostring,x(1),x(2).toint,x(3).toint)).todf
df.registertemptable("uber")
defpas = (s: string) =>
sqlcontext.udf.register("ppa",pas)
val rep = sqlcontext.sql("select dispatching_base_number as dis, ppa(date) as dt ,sum(trips) as cnt from uber group by dispatching_base_number,ppa(date) order by cnt desc")
rep.collect
val rep = sqlcontext.sql("select dispatching_base_number as dis, ppa(date) as dt ,sum(active_vehicles) as cnt from uber group by dispatching_base_number,ppa(date) order by cnt desc")
rep.collect
利用視窗函式按照部門開窗,然後薪資從低到高排序之後,再將某人和他之前的兩個人的薪資求和
employee.json如下
spark.read.json("hdfs:///user/cuixiaojie/employee.json")createorreplacetempview("emp")
spark.sql("select *,lag(salary,1) over(partition by depid order by salary) as lag1,lag(salary,2) over(partition by depid order by salary) as lag2 from emp ").createorreplacetempview("employeelag1lag2")
spark.sql("select *,nvl(salary,0)+nvl(lag1,0)+nvl(lag2,0) sumsalary from employeelag1lag2").show
spark-shell --master yarn-client --executor-memory 1g --num-executors 1 --executor-cores 1
在yarn-client下:containers running消耗en+1; memory used 消耗為en*(em+1)+1;vcores used 消耗為en*ec+1
spark 學習筆記
最近使用spark簡單的處理一些實際中的場景,感覺簡單實用,就記錄下來了。部門使用者業績表 1000w測試資料 使用者 部門 業績 資料載入 val context new sparkcontext conf var data context.textfile data.txt 場景1 求每個部門的...
Spark學習筆記
spark不僅僅支援mapreduce,還支援sql machine learning graph運算等,比起hadoop應用更靈活寬泛。spark 中的rdd 資料結構應對mapreduce中data replication disk io serialization引起的低效問題。rdd 類似於...
Spark學習筆記
hadoop中mapreduce計算框架是基於磁碟的,每次計算結果都會直接儲存到磁碟,下一次計算又要從磁碟中讀取,因而io消耗大,迭代計算效率很低,且模型單一,不能適應複雜需求。spark是一種基於記憶體的開源計算框架,迭代計算效率非常高。另外,mapreduce的計算是一步一步來的,而spark將...