org.apache.nutch.crawl.crawl實現的是乙個完成的抓取過程,所以由它開始。
/* perform complete crawling and indexing (to solr) given a set of root urls and the -solr
parameter respectively. more information and usage parameters can be found below. */
public static void main(string args) throws exception
org.apache.nutch.util.nutchconfiguration
/**初始化時,載入nutch-default.xml,nutch-site.xml.* add the standard nutch resources to .
* * @param conf configuration object to which
* configuration is to be added.
*/private static configuration addnutchresources(configuration conf)
@override
public int run(string args) throws exception
path rooturldir = null;
path dir = new path("crawl-" + getdate());
int threads = getconf().getint("fetcher.threads.fetch", 10);
int depth = 5;
long topn = long.max_value;
string solrurl = null;
//獲得輸入引數
for (int i = 0; i < args.length; i++) else if ("-threads".equals(args[i])) else if ("-depth".equals(args[i])) else if ("-topn".equals(args[i])) else if ("-solr".equals(args[i])) else if (args[i] != null)
}jobconf job = new nutchjob(getconf());
if (solrurl == null)
filesystem fs = filesystem.get(job);
if (log.isinfoenabled())
//建立爬取過程中存放資訊的資料夾,對應著各個階段
path crawldb = new path(dir + "/crawldb"); //
path linkdb = new path(dir + "/linkdb");
path segments = new path(dir + "/segments");
path indexes = new path(dir + "/indexes");
path index = new path(dir + "/index");
//初始化配置資訊
path tmpdir = job.getlocalpath("crawl"+path.separator+getdate());
injector injector = new injector(getconf());
generator generator = new generator(getconf());
fetcher fetcher = new fetcher(getconf());
parsesegment parsesegment = new parsesegment(getconf());
crawldb crawldbtool = new crawldb(getconf());
linkdb linkdbtool = new linkdb(getconf());
// initialize crawldb 初始化crawldb
injector.inject(crawldb, rooturldir);
int i;
for (i = 0; i < depth; i++)
fetcher.fetch(segs[0], threads); // fetch it 抓取
if (!fetcher.isparsing(job))
crawldbtool.update(crawldb, segs, true, true); // update crawldb 更新crawldb資料庫
}if (i > 0)
} else
if (log.isinfoenabled())
return 0;
}
Nutch 原始碼分析 1
org.apache.nutch.crawl.crawl類的主函式如下所示 應該知道,nutch查詢檔案系統是基於linux系統的機制的,所以提供啟動的命令與linux的shell命令很相似。public static void main string args throws exception c...
nutch原始碼閱讀 7 Generator
繼續向下看,第二個job read the subdirectories generated in the temp output and turn them into segments listgeneratedsegments new arraylist 讀取上個job生成的多個fetchlis...
nutch原始碼閱讀 5 Injector總結
nutch的inject 有二個job 第乙個job 如下圖 1 url是否有tab分割的k v 對如果有記錄下來,2 如果配置了過濾使用 urlnormalizers和 urlfilters 對url 進行格式化和過濾,3 如果過濾的url 不為空則建立crawldatum物件,狀態 status...