kaggle DC比賽程序2

teamviewer介面

開啟xshell

teamviewer.gif

tips：

關於本次會議：

介紹個人在演算法 / 程式設計 / 編輯方面的能力；更願意承擔的工作；

能拿出來參與比賽的時間，中間會有什麼個人的重要時間節點；

以下為個人示例：

我對三個方面沒有偏重，但鑑於大家對計算環境不太了解，我更傾向於承擔程式設計方面的工作，當然是會參與演算法的研究和迭代的；

基本每天晚上的9:30之後、週末的一部分時間、上班的閒暇之餘均可以。

之前的工作與本次比賽可能用到的地方：

使用者駕駛行為評分 / 使用者畫像分析 / gps、g-sensor原始資料清洗校準 / 自動化資料報告。。。etc.

稍後我把一些簡單的資料處理操作示例給大家錄gif。

我的提案：

週六晚上前，必須完成如下任務中的一項：

乙個可復現的案例（**、演算法可復現），並說明可借鑑地方，紐約taxi的案例也行，這方面有大量素材；

提出自己的演算法文件或流程，不需要完備。

昨天簡單試驗

上傳資料

注意：id string,lat string,lon double,status int,stamp string

hadoop fs -mkdir /user/yyl; hadoop fs -put /root/temp/* /user/yyl/hoho/; #匯入過程中發現缺少7、13兩天資料，原因未知。 create table if not exists trip_stat_hoho (id string,lat string,lon double,status int,stamp string ) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile; load data '/user/yyl/hoho/*' into table trip_stat_hoho; load data inpath '/user/yyl/hoho/*' into table trip_stat_hoho; "create table if not exists trip_stat_hoho1 (id string,lat string,lon double,status int,stamp string ) row format delimited fields terminated by ',' location '/user/yyl/hoho/user'")

sql(hivecontext,"create table if not exists trip_stat_hoho1 (id string,lat string,lon double,status int,stamp string ) row format delimited fields terminated by ',' location '/user/yyl/hoho/user'")

簡單分析

在spark中執行了count：

除去乙個放在dropbox的檔案共43g

近十億條資料

kaggle DC比賽程序2

kaggle DC比賽程序5

kaggle DC比賽程序3 參考資料

比賽2 總結

kaggle DC比賽程序2

kaggle DC比賽程序5

kaggle DC比賽程序3 參考資料

比賽2 總結

相關推薦