Apache Flume之正則過濾器

在當今的大資料世界中，應用程式產生大量的電子資料 – 這些巨大的電子資料儲存庫包含了有價值的、寶貴的資訊。對於人類分析師或領域專家，很難做出有趣的發現或尋找可以幫助決策過程的模式。我們需要自動化的流程來有效地利用龐大的，資訊豐富的資料進行規劃和投資決策。在處理資料之前，收集資料，聚合和轉換資料是絕對必要的，並最終將資料移動到那些使用不同分析和資料探勘工具的儲存庫中。

執行所有這些步驟的流行工具之一是apache flume。這些資料通常是以事件或日誌的形式儲存。 apache flume有三個主要元件：

flume是高度可配置的，並且支援許多源，channel，serializer和sink。它還支援資料流。 flume的強大功能是***，支援在執行中修改/刪除事件的功能。支援的***之一是regex_filter。

regex_filter將事件體解釋為文字，並將其與提供的正規表示式進行對比，並基於匹配的模式和表示式，包括或排除事件。我們將詳細看看regex_filter。

要求

從資料來源中，我們以街道號，名稱，城市和角色的形式獲取資料。現在，資料來源可能是實時流資料，也可能是任何其他**。在本示例中，我已經使用netcat服務作為偵聽給定埠的源，並將每行文字轉換為事件。要求以文字格式將資料儲存到hdfs中。在將資料儲存到hdfs之前，必須根據角色對資料進行過濾。只有經理的記錄需要儲存在hdfs中;其他角色的資料必須被忽略。例如，允許以下資料：

1,alok,mumbai,manager

2,jatin,chennai,manager

下列的資料是不被允許的：

3,yogesh,kolkata,developer

5,jyotsana,pune,developer

如何達到這個要求

可以通過使用 regex_filter ***來實現。這個***將根據規則基礎來進行事件過濾，只有感興趣的事件才會傳送到對應的槽中，同時忽略其他的事件。

## describe regex_filter interceptor

andconfigure exclude events attribute

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = regex_filter

a1.sources.r1.interceptors.i1.regex = developer

a1.sources.r1.interceptors.i1.excludeevents = true

hdfs 槽允許資料儲存在 hdfs 中，使用文字/序列格式。也可以使用壓縮格式儲存。

a1.channels = c1

a1.sinks = k1

a1.sinks.k1.type = hdfs

a1.sinks.k1.channel = c1

## assumption is

that hadoop

iscdh

a1.sinks.k1.hdfs.path = hdfs:

/hive/warehouse/managers

a1.sinks.k1.hdfs.filetype= datastream

a1.sinks.k1.hdfs.writeformat = text

如何執行示例

首先，你需要 hadoop 來讓示例作為 hdfs 的槽來執行。如果你沒有乙個 hadoop 集群，可以將槽改為日誌，然後只需要啟動 flume。在某個目錄下儲存 regex_filter_flume_conf.conf 檔案然後使用如下命令執行**。

flume-ng agent

--conf conf --conf-file regex_filter_flume_conf.conf --name a1 -dflume.root.logger=info,console

注意**名稱是 a1。我用了 netcat 這個源。

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

一旦 flume **啟動，執行下面命令用來傳送事件給 flume。

telnet localhost 40000

現在我們只需要提供如下輸入文字：

1,alok,mumbai,manager

2,jatin,chennai,manager

3,yogesh,kolkata,developer

4,ragini,delhi,manager

5,jyotsana,pune,developer

6,valmiki,banglore,manager

訪問 hdfs 你會觀察到 hdfs 在 hdfs:/hive/warehouse/managers 下建立了乙個檔案，檔案只包含經理的資料。

完整的 flume 配置 — regex_filter_flume_conf.conf — 如下：

name

the components

onthis agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# describe/configure the source - netcat

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

# describe the hdfs sink

a1.channels = c1

a1.sinks = k1

a1.sinks.k1.type = hdfs

a1.sinks.k1.channel = c1

a1.sinks.k1.hdfs.path = hdfs:

/hive/warehouse/managers

a1.sinks.k1.hdfs.filetype= datastream

a1.sinks.k1.hdfs.writeformat = text

## describe regex_filter interceptor and

configure exclude events attribute

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = regex_filter

a1.sources.r1.interceptors.i1.regex = developer

a1.sources.r1.interceptors.i1.excludeevents = true

# use a channel which buffers events in

memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactioncapacity = 100

# bind the source and

sink

tothe channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

Apache Flume之正則過濾器

過擬合之正則化方法

過擬合解決方案之正則化

過擬合與正則化

Apache Flume之正則過濾器

過擬合之正則化方法

過擬合解決方案之正則化

過擬合與正則化

相關推薦