mapreduce在倒排索引中練習

倒排索引是檔案檢索系統中常用的資料結構，被廣泛應用於全文章搜尋引擎。

通常情況下，倒排索引由乙個單詞或片語以及相關的文件列表組成，文件列表中的文件或者是標識文件的id

號，或者是指定文件所在位置的

uri；

在實際應用中，往往還需要給每個文件加乙個權值，用來指出每個文件與搜尋內容的相關度；

我的例子中，文件內容如下：

hadoop11:/home/in/win1 # hadoop fs -cat /user/root/in1/words.txt

mapreduce is ******

hadoop11:/home/in/win1 # hadoop fs -cat /user/root/in1/words1.txt

mapreduce is powerfull and is ******

hadoop11:/home/in/win1 # hadoop fs -cat /user/root/in1/words2.txt

cat: file does not exist: /user/root/in1/words2.txt

我的目標結果：

and words1.txt:1;

bye words3.txt:1;

hello words3.txt:1;

is words.txt:1;words1.txt:2;

mapreduce words1.txt:1;words3.txt:2;words.txt:1;

powerfull words1.txt:1;

****** words1.txt:1;words.txt:1;

**清單，自己import吧：

public class indexsum }}

public static class intsumreducer extends reducer

int splitindex = key.tostring().indexof(":");

result.set(key.tostring().substring(splitindex + 1) + ":" + sum);

key.set(key.tostring().substring(0, splitindex));

context.write(key, result);

}public static class intsumreducer3 extends reducer

result.set(valuearray);

context.write(key, result);

}public static void main(string args) throws exception ;

args = argstemp;

file jarfile = ejob.createtempjar("bin");

configuration conf = new configuration();

conf.set("hadoop.job.ugi", "root,root");

conf.set("fs.default.name", "hdfs://hadoop11:8020/");

conf.set("mapred.job.tracker", "hadoop11:8021");

string otherargs = new genericoptionsparser(conf, args)

.getremainingargs();

if (otherargs.length != 2)

job job = new job(conf, "word count");

job.setjarbyclass(logcount.class);

((jobconf) job.getconfiguration()).setjar(jarfile.tostring());

job.setreducerclass(intsumreducer3.class);

fileinputformat.addinputpath(job, new path(otherargs[0]));

fileoutputformat.setoutputpath(job, new path(otherargs[1]));

job.setoutputkeyclass(text.class);

job.setoutputvalueclass(text.class);

system.exit(job.waitforcompletion(true) ? 0 : 1);

log.info("***************end at : " + new date());}}

參考apache官網的例子；

MapReduce倒排索引簡單實現

倒排索引倒排索引是文件檢索系統中最常用的資料結構，被廣泛的應用於全文搜尋引擎。它主要用來儲存某個單詞或片語在乙個文件或一組文件中的儲存位置的對映，即提供了一種根據內容來查詢文件的方式，由於不是根據文件來確定文件所包含的內容，而是進行了相反的操作，因而被稱為倒排索引。例如 input 輸入有三個...

倒排索引和MapReduce簡介

1.前言學習hadoop的童鞋，倒排索引這個演算法還是挺重要的。這是以後展開工作的基礎。首先，我們來認識下什麼是倒排索引 2.mapreduce框架簡介 2.1inputformat類 inputformat類的作用是什麼呢？其實就是把輸入的資料就是你上傳到hdfs的檔案切分成乙個個的spli...

MapReduce練習之倒排索引

實現統計多個文件中乙個單詞出現的頻數和出現在哪個文件中在map中讀取當前文件的每一行資料,得到當前文件路徑 mapkey 單詞文件路徑 mapvalue 數值1 在map端設定combiner類整合資料,減少向reduce端傳輸資料的網路開銷將map的輸出重新組合輸出單詞,文件路徑單詞頻...

mapreduce在倒排索引中練習

MapReduce倒排索引簡單實現

倒排索引和MapReduce簡介

MapReduce練習之倒排索引

相關推薦