繼續向下看,第二個job
....................// 這裡主要是通過urlpartitioner來做的,具體是按哪乙個來分類,是通用引數來配置的,這裡有partition_mode_domain,partition_mode_ip....................
....................
// read the subdirectories generated in the temp
// output and turn them into segments
listgeneratedsegments = new arraylist();
//讀取上個job生成的多個fetchlist的segment
filestatus status = fs.liststatus(tempdir);
try
} catch (exception e)
if (generatedsegments.size() == 0)
....................
....................
....................
// 來配置,預設是按url的hashcode來分。
private path partitionsegment(filesystem fs, path segmentsdir, path inputdir,int numlists) throws ioexception
//產生乙個新的目錄,以當前時間明明
path segment = new path(segmentsdir, generatesegmentname());
//在上面的目錄下,再產生乙個特定的crawl_generate目錄
path output = new path(segment, crawldatum.generate_dir_name);
log.info("generator: segment: " + segment);
nutchjob job = new nutchjob(getconf());
job.setjobname("generate: partition " + segment);
job.setint("partition.url.seed", new random().nextint());
fileinputformat.addinputpath(job, inputdir);
job.setinputformat(sequencefileinputformat.class);
job.setmapoutputkeyclass(text.class);
job.setmapoutputvalueclass(selectorentry.class);
job.setpartitionerclass(urlpartitioner.class);
job.setreducerclass(partitionreducer.class);
job.setnumreducetasks(numlists);
fileoutputformat.setoutputpath(job, output);
job.setoutputformat(sequencefileoutputformat.class);
job.setoutputkeyclass(text.class);
job.setoutputvalueclass(crawldatum.class);
job.setoutputkeycomparatorclass(hashcomparator.class);
jobclient.runjob(job);
return segment;
}
nutch原始碼閱讀 1 Crawl
org.apache.nutch.crawl.crawl實現的是乙個完成的抓取過程,所以由它開始。perform complete crawling and indexing to solr given a set of root urls and the solr parameter respec...
nutch原始碼閱讀 5 Injector總結
nutch的inject 有二個job 第乙個job 如下圖 1 url是否有tab分割的k v 對如果有記錄下來,2 如果配置了過濾使用 urlnormalizers和 urlfilters 對url 進行格式化和過濾,3 如果過濾的url 不為空則建立crawldatum物件,狀態 status...
《原始碼閱讀》原始碼閱讀技巧,原始碼閱讀工具
檢視某個類的完整繼承關係 選中類的名稱,然後按f4 quick type hierarchy quick type hierarchy可以顯示出類的繼承結構,包括它的父類和子類 supertype hierarchy supertype hierarchy可以顯示出類的繼承和實現結構,包括它的父類和...