nutch原始碼閱讀 1 Crawl

2021-09-01 17:43:48 字數 2507 閱讀 8881

org.apache.nutch.crawl.crawl實現的是乙個完成的抓取過程,所以由它開始。

/* perform complete crawling and indexing (to solr) given a set of root urls and the -solr

parameter respectively. more information and usage parameters can be found below. */

public static void main(string args) throws exception

org.apache.nutch.util.nutchconfiguration

/**

* add the standard nutch resources to .

* * @param conf configuration object to which

* configuration is to be added.

*/private static configuration addnutchresources(configuration conf)

初始化時,載入nutch-default.xml,nutch-site.xml.

@override

public int run(string args) throws exception

path rooturldir = null;

path dir = new path("crawl-" + getdate());

int threads = getconf().getint("fetcher.threads.fetch", 10);

int depth = 5;

long topn = long.max_value;

string solrurl = null;

//獲得輸入引數

for (int i = 0; i < args.length; i++) else if ("-threads".equals(args[i])) else if ("-depth".equals(args[i])) else if ("-topn".equals(args[i])) else if ("-solr".equals(args[i])) else if (args[i] != null)

}jobconf job = new nutchjob(getconf());

if (solrurl == null)

filesystem fs = filesystem.get(job);

if (log.isinfoenabled())

//建立爬取過程中存放資訊的資料夾,對應著各個階段

path crawldb = new path(dir + "/crawldb"); //

path linkdb = new path(dir + "/linkdb");

path segments = new path(dir + "/segments");

path indexes = new path(dir + "/indexes");

path index = new path(dir + "/index");

//初始化配置資訊

path tmpdir = job.getlocalpath("crawl"+path.separator+getdate());

injector injector = new injector(getconf());

generator generator = new generator(getconf());

fetcher fetcher = new fetcher(getconf());

parsesegment parsesegment = new parsesegment(getconf());

crawldb crawldbtool = new crawldb(getconf());

linkdb linkdbtool = new linkdb(getconf());

// initialize crawldb 初始化crawldb

injector.inject(crawldb, rooturldir);

int i;

for (i = 0; i < depth; i++)

fetcher.fetch(segs[0], threads); // fetch it 抓取

if (!fetcher.isparsing(job))

crawldbtool.update(crawldb, segs, true, true); // update crawldb 更新crawldb資料庫

}if (i > 0)

} else

if (log.isinfoenabled())

return 0;

}

Nutch 原始碼分析 1

org.apache.nutch.crawl.crawl類的主函式如下所示 應該知道,nutch查詢檔案系統是基於linux系統的機制的,所以提供啟動的命令與linux的shell命令很相似。public static void main string args throws exception c...

nutch原始碼閱讀 7 Generator

繼續向下看,第二個job read the subdirectories generated in the temp output and turn them into segments listgeneratedsegments new arraylist 讀取上個job生成的多個fetchlis...

nutch原始碼閱讀 5 Injector總結

nutch的inject 有二個job 第乙個job 如下圖 1 url是否有tab分割的k v 對如果有記錄下來,2 如果配置了過濾使用 urlnormalizers和 urlfilters 對url 進行格式化和過濾,3 如果過濾的url 不為空則建立crawldatum物件,狀態 status...