Spark實現分組TopN

在許多資料中，都存在類別的資料，在一些功能中需要根據類別分別獲取前幾或後幾的資料，用於資料視覺化或異常資料預警。在這種情況下，實現分組topn就顯得非常重要了，因此，使用了spark聚合函式和排序演算法實現了分布式topn計算功能。

* 計算分組topn

9* created by administrator on 2019/11/20.

10*/

11object grouptopn )

22//

設定資料模式

23 val structtype =structtype(array(

24 structfield("scene", stringtype, true

),25 structfield("cycle", stringtype, true)26

))27

//轉換為df

28 val test_data_df =spark.createdataframe(test_data_rdd, structtype)

29 test_data_df.createorreplacetempview("test_data_df")

30//

拼接週期

31 val scene_ws = spark.sql("select scene,concat_ws(',',collect_set(cycle)) as cycles from test_data_df group by scene")

32scene_ws.count()

33scene_ws.show()

34 scene_ws.createorreplacetempview("scene_ws")

35/**

36* 定義引數確定n的大小，暫定為1

37*/

38 val sum = 1

39//

建立廣播變數，把n的大小廣播出去

40 val broadcast =sc.broadcast(sum)

41/**

42* 定義udf實現獲取組內的前n個資料

當n大於1時，多個資料會拼接在一起，若想每個一行，可是使用使用列轉行功能，參考我的部落格：

分組Top N 問題

今天面試，面試官給了這樣乙個場景有兩張表，一張表存放車隊id，班組id，司機id 另一種表存放司機id，運營時間，運營里程要查詢出7月份每個車隊每個班組裡的 top 3 這就要用到row number 函式首先按需求建兩張表 create table demo of topn car comp...

Spark 實現TopN的問題（groupBy）

b t2.txt b ab 11 ab 23 ab 13 ab 44 bb 32 bb 88 讀取檔案 var lines sc.textfile test t2.txt 對鍵值進行分組 var ss lines.map split map f f 0 f 1 groupby f f.1 轉換成 x...

hive 分組排序，topN

hive 分組排序，topn 語法格式 row number over partition by col1 order by col2 desc rank partition by 類似hive的建表，分割槽的意思 order by 排序，預設是公升序，加desc降序 rank 表示別名表示根據c...

Spark實現分組TopN

分組Top N 問題

Spark 實現TopN的問題（groupBy）

hive 分組排序，topN

相關推薦