hive UDAF求平均值

最近做資料遷移專案，把聚合部分從kettle遷移到hadoop集群上，需要寫很多聚合指令碼

在論壇是看到alipay同事寫過類似cube的udaf,**拿過來執行下報錯，有幾個地方沒看多，而且沒有注釋，只好從基礎開始看，自己搞乙個，之前寫過udf所以入手還是聽快的

準備：1、實現自己的udaf需要整合udaf

2、至少有乙個內部類，實現了org.apache.hadoop.hive.ql.exec.udafevaluator

3、必須寫 inin方法

4、並且有個方法名叫iterate（）引數任意，這個方法是函式入口

5、並且要有terminatepartial，merge，terminate，算上上面的共5個方法，下面一一講解


public class mean extends udaf 
private partialresult presult;
@override
public void init() 
public boolean iterate(doublewritable value) 
if (presult == null) 
presult.sum += value.get();
presult.count++;
return true;
}public partialresult terminatepartial() 
public boolean merge(partialresult other) 
if (presult == null) 
presult.sum += other.sum;
presult.count++;
return true;
}public doublewritable terminate() 
return new doublewritable(presult.sum / presult.count);}}

inin（）函式可以用來做初始化操作，一般會將統計變數置空，重置內部狀態

iterate方法是函式的入口，引數個數和型別和udaf實現功能息息相關

terminatepartial 需要部分聚集是呼叫該函式，因為計算是不同的資料塊會分到不同的map端，計算之後再傳輸到reduce端，很多計算是可以在map後面繼續計算一次，比如求最大值（求平均值則不可以），這個時候就會呼叫terminatepartial函式，函式必須返回乙個封裝了聚集計算當前狀態的物件，傳入reduce端

merge函式，資料傳輸到reduce端前呼叫該函式，所以入參必須和terminatepartial返回值相同

terminate函式：hive最終聚集時會呼叫terminate，返回計算結果

spark 求平均值

val rdd sc.makerdd list a 1 a 2 a 3 b 1 b 2 b 3 b 4 a 4 2 rdd.combinebykey x x,1 x int,int y int x.1 y,x.2 1 x int int y int int x.1 y.1,x.2 y.2 mapva...

大數求平均值公式

方法1 維護乙個cnt記錄當前資料的個數，evr記錄當前的平均值然後每增減乙個新資料val時，更新這個cnt和evr的值即可 evr val evr cnt int main 模擬，每產生乙個數就更新這個cnt和evr for i 0 i sizeof d sizeof d 0 i cout 方法...

MapReduce之求平均值

1 map端讀取檔案資訊內容在讀取檔案資訊內容時，首先對檔案資訊進行切分，將檔案切分為key和value，便於檔案資訊的計算 override protected void map longwritable key,text value,context context throws ioexcep...

hive UDAF求平均值

spark 求平均值

大數求平均值公式

MapReduce之求平均值

相關推薦