nutch原始碼閱讀 1 Crawl

org.apache.nutch.crawl.crawl實現的是乙個完成的抓取過程，所以由它開始。

/* perform complete crawling and indexing (to solr) given a set of root urls and the -solr
parameter respectively. more information and usage parameters can be found below. */
public static void main(string args) throws exception

org.apache.nutch.util.nutchconfiguration

/**
* add the standard nutch resources to .
* * @param conf               configuration object to which
*                           configuration is to be added.
*/private static configuration addnutchresources(configuration conf)

初始化時，載入nutch-default.xml,nutch-site.xml.

@override
public int run(string args) throws exception 
path rooturldir = null;
path dir = new path("crawl-" + getdate());
int threads = getconf().getint("fetcher.threads.fetch", 10);
int depth = 5;
long topn = long.max_value;
string solrurl = null;
//獲得輸入引數
for (int i = 0; i < args.length; i++)  else if ("-threads".equals(args[i]))  else if ("-depth".equals(args[i]))  else if ("-topn".equals(args[i]))  else if ("-solr".equals(args[i]))  else if (args[i] != null) 
}jobconf job = new nutchjob(getconf());
if (solrurl == null) 
filesystem fs = filesystem.get(job);
if (log.isinfoenabled()) 
//建立爬取過程中存放資訊的資料夾,對應著各個階段
path crawldb = new path(dir + "/crawldb");  //
path linkdb = new path(dir + "/linkdb");
path segments = new path(dir + "/segments");
path indexes = new path(dir + "/indexes");
path index = new path(dir + "/index");
//初始化配置資訊
path tmpdir = job.getlocalpath("crawl"+path.separator+getdate());
injector injector = new injector(getconf());
generator generator = new generator(getconf());
fetcher fetcher = new fetcher(getconf());
parsesegment parsesegment = new parsesegment(getconf());
crawldb crawldbtool = new crawldb(getconf());
linkdb linkdbtool = new linkdb(getconf());
// initialize crawldb 初始化crawldb
injector.inject(crawldb, rooturldir);
int i;
for (i = 0; i < depth; i++) 
fetcher.fetch(segs[0], threads);  // fetch it 抓取
if (!fetcher.isparsing(job)) 
crawldbtool.update(crawldb, segs, true, true); // update crawldb 更新crawldb資料庫
}if (i > 0) 
} else 
if (log.isinfoenabled()) 
return 0;
}

Nutch 原始碼分析 1

org.apache.nutch.crawl.crawl類的主函式如下所示應該知道，nutch查詢檔案系統是基於linux系統的機制的，所以提供啟動的命令與linux的shell命令很相似。public static void main string args throws exception c...

nutch原始碼閱讀 7 Generator

繼續向下看，第二個job read the subdirectories generated in the temp output and turn them into segments listgeneratedsegments new arraylist 讀取上個job生成的多個fetchlis...

nutch原始碼閱讀 5 Injector總結

nutch的inject 有二個job 第乙個job 如下圖 1 url是否有tab分割的k v 對如果有記錄下來，2 如果配置了過濾使用 urlnormalizers和 urlfilters 對url 進行格式化和過濾，3 如果過濾的url 不為空則建立crawldatum物件，狀態 status...

nutch原始碼閱讀 1 Crawl

Nutch 原始碼分析 1

nutch原始碼閱讀 7 Generator

nutch原始碼閱讀 5 Injector總結

相關推薦