Larbin學習小結

larbin是乙個用c++開發的開源網路爬蟲，有一定的定製選項和較高的網頁抓取速度。

下圖表示了一般爬蟲抓取網頁的基本過程。

抓取以/larbin.conf中的starturl做為種子urls開始。

下面先來看用於處理url的類：

上面的類圖只顯示了url類可見的介面。除了基本的建構函式和私有變數的get函式，url模擬較重要的函式是hashcode( )，其實現為：

/* return a hashcode for this url */

uinturl::hashcode () {

unsigned int h=port;

unsigned int i=0;

while (host[i] != 0) {

h = 31*h + host[i];

i++;

i=0;

while (file[i] != 0) {

h = 31*h + file[i];

i++;

return h % hashsize;

在全域性變數globle.h中，hashtable *seen用來表示抓取中出現過的urls。按照larbin的預設值hashsize為64,000,000，*seen是乙個大小為8,000,000的char陣列的hashtable結構，該char陣列*table的被置為1的bit位可以代表乙個出現過的url。

設定*table陣列bit位的函式實現如下：

/* add a newurl in the hashtable

* return true if it has been added

* return false if it has already been seen */

boolhashtable::testset (url *u) {

unsigned int code = u->hashcode();

unsigned int pos = code / 8; //該url的hashcode在table中的索引

unsigned int bits = 1 << (code % 8); //該hashcode在位元組中的bit位

int res = table[pos] & bits; //判斷對應bit位是否為1，為1時res為正數

table[pos] |= bits; //將對應bit位標記為1

return !res;

/*set a page in the hashtable

* return false if it was already there

* return true if it was not (ie it is new)*/

bool hashdup::testset(char *doc) {

unsigned int code = 0;

char c;

for (uint i=0; (c=doc[i])!=0; i++) {

if (c>'a' && c

code = (code*23 + c) % size;

unsigned int pos = code / 8;

unsigned int bits = 1 << (code % 8);

int res = table[pos] & bits;

table[pos] |= bits;

return !res;

顯然這裡的網頁內容去重過於簡單，只能區分指向頁面內容完全相同的不同鏈結。

為了分析和處理抓取回的網頁和robot檔案，larbin中設計了file類， html類和robots類繼承了file類。下面的類圖顯示了這幾類的關係和相關函式。

html類中的inputheaders()函式用於處理獲得的網頁檔案的頭部，包括鏈結、狀態等。endinput函式呼叫其他私有函式進行一系列關於網頁內容的分析和鏈結的提取管理操作。endinput先呼叫global::hduplicate->testset( )檢測網頁是否重複，再呼叫parsehtml()進行網頁內容的分析。parsehtml( )中比較重要的是呼叫parsetag( )分析html標籤，然後呼叫parsecontent(action)針對標籤內容進行url的處理。parsecontent(action)中呼叫了manageurl( )函式，manageurl( )呼叫/fetch/checker.cc中的check( )函式將url的加入前端佇列。

persistentfifo*urlsdisk; //作為url集合的前段佇列，加入新的url

persistentfifo*urlsdiskwait; //提供對一定數量的前段url的非阻塞訪問，

//即當隊列為空時不阻塞，而直接返回null

namedsite *namedsitelist; //使用url的hashcode為索引的已訪問站點的大表，//維護各個站點的dns屬性、對應站點待抓取url佇列

fifo*oksites; //表示佇列中存在已獲得主機位址的站點的待抓取url

fifo*dnssites; //表示佇列中存在可抓取的站點url，但需要dns解析呼叫

namedsite*namedsitelist是維護url佇列的重要資料結構。在larbin的預設設定中，它可以維護20000個站點的待抓取佇列。namedsite類的類圖如下：

可以看出，namedsite對每個站點維護著乙個fifo佇列。根據不同站點的dns狀態，它使用newquery()通過adns庫建立新的dns非同步解析服務（dns服務模組使用了開源的adns庫，同時可以通過puturl()和puturlwait( )來管理url後端佇列global::*oksites的url新增。

larbin更詳細的實現和定製細節可參考原始碼注釋。

Larbin學習小結

Larbin之PersistentFifo類解析

larbin程式重啟方案

爬蟲larbin執行過程

Larbin學習小結

Larbin之PersistentFifo類解析

larbin程式重啟方案

爬蟲larbin執行過程

相關推薦