hivemall之時間序列資料異常檢測

安裝hivemall

$ git clone $ cd incubator-hivemall $ bin/build.sh

啟動hive和匯入相關 jar

add jar /home/hadoop/incubator-hivemall/target/hivemall-core -0.4 .2-rc .2-with -dependencies .jar; source /home/hadoop/incubator-hivemall/resources/ddl/define-all.hive; create database twitter; usetwitter;

建立外部表及設定外部資料存放位置

create external table timeseries ( num int, value double ) row format delimited fields terminated by '#'stored as textfile location '/dataset/twitter/timeseries';

源資料格式

182.478 176.231 183.917 177.798 165.469 181.878 184.502 183.303 177.578 171.641

匯入資料格式

1 #182.478 2#176.231 3#183.917 4#177.798 5#165.469 6#181.878 7#184.502 8#183.303 9#177.578 10#171.641

資料上傳hdfs

hadoop fs -put twitter.t /dataset/twitter/timeseries

使用sst

select
num,
sst(value, "-threshold 0.005") as result
from
timeseries
order
by num asc;

結果示例：

7551 7552 7553 7554 7555 7556 7557 7558 7559 7560

outlier and change-point detection using changefinder

select num, changefinder(value, "-outlier_threshold 0.03 -changepoint_threshold 0.0035") as result from timeseries order by num asc;

結果示例：

16 1718 1920

21

日期平台平台訪問次數序號 outlier_score changepoint_score is_outlier is_changepoint 2016/9/13 weixin 163770 820.517939405 0.002799403 true false 2016/8/13 weixin 163770 2060.971553367 0.002563691 true false 2016/7/13 weixin 163770 3290.978151518 0.002569553 true false 2016/6/12 weixin 163770 4530.893766225 0.005063597 true true 2016/5/12 weixin 163770 5790.846256467 0.183253975 true true 2016/4/11 weixin 163770 7041.125480999 2.552122451 true true 2016/3/11 weixin 163770 8280.799214319 1.437753369 true true 2016/2/9 weixin 163770 9490.768004277 2.03710669 true true 2016/1/9 weixin 163770 1075 1.019095702 3.674865513 true true 2015/12/9 weixin 163770 1197 0.726433576 1.354028448 true true 2015/11/8 weixin 163770 1321 0.740644184 2.317692273 true true 2015/10/8 weixin 163770 1447 0.961904857 3.609401953 true true 2015/9/7 weixin 163770 1572 0.917379836 2.44891755 true true

篩選出 is_outlier=true

挑選出weixin平台，發現2016和2015部分資料的平台訪問次數相同

檢視原始16年和15年順豐資料

2016/01、2016/03、2016/05、2016/07、2016/08、2015/10、2015/12資料大小相等。

初步斷定資料重複

通過對其下子檔案大小比較，基本斷定上述日期資料為同乙個資料

himall官方手冊：

hivemall之時間序列資料異常檢測

時間序列資料

pandas之時間序列

Python處理時間序列資料

hivemall之時間序列資料異常檢測

時間序列資料

pandas之時間序列

Python處理時間序列資料

相關推薦