Spark文字挖掘機器學習實現

給定檔案的格式

①.通過spark相關api將爬取到的資料進行處理，得到結構化的資料表

val filter = new stoprecognition()

filter.insertstopnatures("f","b","p","d","w","v","c","u") //過濾掉標點

val sparkcontext = new sparkcontext(conf)

val sqlcontext = new sqlcontext(sparkcontext)

val url = "c:\\users\\shuangmm\\desktop\\data\\jobarea=010000&industrytype=01.json"

val datadf =sqlcontext.read.format("json")

.option("header","true")

.option("inferschema",true.tostring)//這是自動推斷屬性列的資料型別。

.load(url)//.show(10)//檔案的路徑

②.分析某幾個條件下的分類數量排名（1-3年工作經驗的大資料工程師的平均薪資情況）

val pay_job = datadf.select("slry","title","expr")
val slry_rank = pay_job.rdd
.map .filter //.foreach(println)

分割後的結果

// val des = x.split("年").last//(3)+x.split(" ")(4)//+x.lastoption

(mean ,"大資料工程師")

} catch

}}.sortby(_._1,false,1).foreach(println)

排名結果

③ 將 dscr 描述字段提純後通過演算法對用人需求記錄打標籤。

總體思路，觀察檔案的結構可知檔案中有一列是cate用來記錄崗位的資訊，dscr是用來描述崗位，使用，cate作為標籤，dscr作為特徵進行來進行訓練

val strstop = source.fromfile(new file("c:\\users\\shuangmm\\desktop\\stopword.txt"))("utf-8")
.getlines().toarray
val dscr = datadf.select("cate","dscr")
val index = new stringindexer()
.setinputcol("cate")
.setoutputcol("cateindex_string")
.fit(dscr)

val remover = new stopwordsremover()
.setstopwords(strstop)
.setinputcol("words")
.setoutputcol("filter")
val re = remover.transform(wordsdata)//.show(5)
val hashingtf =
new hashingtf().setinputcol("filter").setoutputcol("rawfeatures").setnumfeatures(1000)
val featurizeddata = hashingtf.transform(re)
val idf = new idf().setinputcol("rawfeatures").setoutputcol("features")
val idfmodel = idf.fit(featurizeddata)
// val rescaleddata = idfmodel.transform(featurizeddata)

val array(trainingdata, testdata) = featurizeddata.randomsplit(array(0.7, 0.3))
val rf = new randomforestclassifier()
.setlabelcol("cateindex_string")
.setfeaturescol("features")
.setnumtrees(15)
val labelconverter = new indextostring()
.setinputcol("prediction")
.setoutputcol("predictedlabel")
.setlabels(index.labels)

val pipeline = new pipeline()
.setstages(array(index,idfmodel, rf, labelconverter))
// trainingdata.show(5)
// train model. this also runs the indexers.
val model = pipeline.fit(trainingdata)
// make predictions.
val predictions = model.transform(testdata)
// select example rows to display.
predictions.select("predictedlabel",  "filter").show(100)
val evaluator = new multiclassclassificationevaluator()
.setlabelcol("cateindex_string")
.setpredictioncol("prediction")
.setmetricname("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("test error = " + (1.0 - accuracy))

**結果

錯誤率還算比較高0.87左右，主要是沒有在標籤文字中做一些處理，比如一些初級軟體工程師和高階軟體工程師其實可以劃分成為乙個類別，此處沒有細分。本次實驗主要是spark文字資訊的一些處理和分析，包括模型的建立以及做一些**，碰到的一些問題主要是在文字分割的部分，還有在特徵文字轉化的時候，總會出現型別不匹配的問題，在vector和string之間的處理一直轉換不明白，後來換了一種處理方法，在spark官網上看到了一些分類模型的處理方式，建立pineline的方法。最大的收穫就是以後要多看官網上的方法！！！可以避坑。時間比較倉促

詳細**請看這裡

點開有驚喜

Spark文字挖掘機器學習實現

機器學習文字挖掘

資料探勘機器學習深度學習常用資料集

深度學習資料探勘機器學習的區別聯絡

Spark文字挖掘機器學習實現

機器學習 文字挖掘

資料探勘 機器學習 深度學習常用資料集

深度學習 資料探勘 機器學習的區別聯絡

相關推薦

機器學習文字挖掘

資料探勘機器學習深度學習常用資料集

深度學習資料探勘機器學習的區別聯絡