Python URL深度採集原始碼分析

現在市面上url採集工具大把大把的，重複造輪子也沒有啥意思

但是還會執著寫了這個工具，一是用自己的安全無後門擔憂，二是寫工具的能提高自身水平，三主要是市面上url採集工具不夠強，抓取力度不夠...

而且還只是乙個url啊，就算用我自己部落格之前寫的url採集工具採集一次也能採集到幾百個url，幾百個url放進去進行深度採集去重複後還能剩下幾萬個url

查詢旁站的時候我在站長工具&webscan&等等一系列的平台測試後，最終還是採用了webscan

原始碼如下

#coding=utf-8
import re
import requests
import os
import time
print unicode('''
''','utf-8')
#查詢原始url的旁站，儲存在urlx1.txt當中
def tongip():
f1 = open('urlx.txt','r')   #我上個url採集到的url位址
f2 = f1.readlines()
f3 = open('urlx1.txt','w')
for uuu in f2:
urla = uuu.strip('\n')
urlip = '' + urla    #構造url，提交上去就能得到ip
try:
r = requests.get(url=urlip,headers=headers,timeout=10)
zhaohan1 = re.findall('{"ip":"(.*?)",',r.content)  #正則匹配出ip
print zhaohan1
urlip2 = '' + ''.join(zhaohan1)
print urlip2
r1 = requests.get(url=urlip2)
zhaohan2 = re.findall('domain":"(.*?)",',r1.content)  #正則匹配到二次傳送資料的url
for zhaohan3 in zhaohan2:
print zhaohan3
zhaohan4 = zhaohan3.replace('\\','')
f3.write(zhaohan4 + '\n')
except:
print 'no found'
pass
f3.close()
tongip()
def yqlj():
f1 = open('urlx1.txt','r')
f2 = f1.readlines()
f3 = open('urlx2.txt','w')
for uuu in f2:
urla = uuu.strip('\n')
try:
yq = re.findall('',r.content)   #對不少url做測試，尋找一些差異 都寫進去防止爬行不全面
yq1 = re.findall('',r.content)
yq2 = re.findall('',r.content)
yq3 = re.findall('',r.content)
for zhaohan in yq:
print zhaohan
f3.write(zhaohan + '\n')
for zhaohan1 in yq1:
print zhaohan1
f3.write(zhaohan1 + '.cn' + '\n')
for zhaohan2 in yq2:
print zhaohan2
f3.write(zhaohan2 + '.com' + '\n')
for zhaohan3 in yq3:
print zhaohan3
f3.write(zhaohan3 + '.net' + '\n')
except:
print 'no found'
pass
f1.close()
f3.close()
yqlj()
def zcpz():
f1 = open('urlx2.txt','r')
f2 = f1.readlines()
f3 = open('urlx3.txt','a+')
for uuu in f2:
urla = uuu.strip('\n')
try:
urlip = '' + urla
r = requests.get(url=urlip,headers=headers,timeout=10)
zhaohan1 = re.findall('{"ip":"(.*?)",',r.content)
print zhaohan1
urlip2 = '' + ''.join(zhaohan1)
print urlip2
r1 = requests.get(url=urlip2)
zhaohan2 = re.findall('domain":"(.*?)",',r1.content)
for zhaohan3 in zhaohan2:
print zhaohan3
zhaohan4 = zhaohan3.replace('\\','')
f3.write(zhaohan4 + '\n')
except:
print 'no found'
pass
f3.close()
zcpz()
os.remove('urlx1.txt')
os.remove('urlx2.txt')
f5 = open('result.txt','a+')
for neti in set( x for x in open( 'urlx3.txt' ).read( ).replace( '\n' ,' ' ).split( ' ' ) if x ):   #去除重複
f5.write(neti + '\n')
f5.close()
os.remove('urlx3.txt')
print '*************************==over***********************************==='
time.sleep(10)  #最後只保留result.txt  刪除了urlx123.txt  也可以不刪除，用來檢視

用乙個原始url做測試後的執行結果，爬行出來了2988個url，如果一次幾百個我都不敢去試.....

AbstractCollection原始碼分析

abstractcollection抽象類提供了collection的骨架實現,collection分析請看這裡直接看它的是如何實現的.public abstract iterator iterator 該方法沒有實現.public abstract int size 該方法沒有實現.publi...

ThreadPoolExecutor原始碼閱讀

執行緒池解決兩個問題一是復用執行緒，減少建立銷毀執行緒帶來系統開銷二是限定系統資源使用邊界，避免大量執行緒消耗盡系統記憶體適用於互不依賴，執行時間短，不需要對執行緒控制操作的執行緒新增任務時，1.若執行緒數量小於corepoolsize，則新增執行緒執行任務 2.若執行緒數量大於等於core...

OrangePi One Android 原始碼編譯

一系統環境搭建參照二 lichee原始碼編譯 1.檢視help build.sh h2.配置核心 cd linux 3.4 make arch arm menuconfig 進入配置頁面，上下移動列表，空格是選擇列表，左右移動選擇退出選項 3.首次編譯執行清除在 lichee linux3.4...

Python URL深度採集 原始碼分析

AbstractCollection原始碼分析

ThreadPoolExecutor原始碼閱讀

OrangePi One Android 原始碼編譯

相關推薦

Python URL深度採集原始碼分析