爬蟲練手,主要運用requests,由於要對script內部進行分析,所以就直接用了 re 正則匹配,平時也可以用用beautifulsoup, 應該更加方便
讀取首頁,就是如 為了全部抓取,我們這裡都是 1.htm 結尾
遞迴抓取 全部頁面這裡另外做的一點實際 弄了個xcar_lst 記錄所有頁面、等資訊,只是留作記錄,暫時沒用
python
# coding:utf-8
__author__ = 'bonfy chen'
import requests
import re
proxies = none
headers =
base_folder = 'd:/***_folder/'
class xcardown(object):
_base_folder = none
_proxies = none
_headers = none
_website = ''
_xcar_lst =
def set_base_folder(self, base_folder):
self._base_folder = base_folder
def set_headers(self, headers):
self._headers = headers
def set_proxies(self, proxies):
self._proxies = proxies
def __init__(self, base_folder=base_folder, proxies=proxies, headers=headers):
self.set_base_folder(base_folder)
self.set_headers(headers)
self.set_proxies(proxies)
def download_image_from_url(self, url, name=none):
"""download_image_from_url
:param url: the resource image url
:param name: he destination file name
:return:
"""local_filename = name + '_' + url.split('/')[-1]
r = requests.get(url, proxies=self._proxies, headers=self._headers, stream=true)
with open(self._base_folder + local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
f.flush()
f.close()
return local_filename
def download_xcar(self, url):
""":param url: the source url in xcar.com.cn
/2674/2015/detail/1.htm
:return:
"""r = requests.get(url, proxies=self._proxies, headers=self._headers)
# print r.encoding
r.encoding = 'gbk'
m1 = re.search(r"var nexturl = '(?p.*.htm)'", r.text)
next_url = m1.groupdict()['n_url'] if m1 else none
m2 = re.search(r"
", r.text)
title = m3.groupdict()['title'] if m3 else ''
m4 = re.search(r"(?p.*)
", r.text)
cont = m4.groupdict()['cont'] if m4 else ''
m5 = re.search(r"(?p.*)", r.text)
model = m5.groupdict()['model'] if m5 else ''
if pic_url:
try:
self.download_image_from_url(pic_url, name='_'.join([model, title, cont]))
print 'download complete: pic from {} '.format(pic_url)
except ioerror:
print 'file name ioerror'
self.download_image_from_url(pic_url, name=model)
print 'download complete: pic from {} '.format(pic_url)
except exception as e:
print e
dct = dict(pic_url=pic_url, next_url=next_url, title=title, cont=cont, model=model)
if next_url[-4:] == '.htm':
self.download_xcar(self._website + next_url)
if __name__ == '__main__':
print("welcome to the pic download for xcar.com")
print("downloaded files in the folder: " + base_folder )
print("---------------------------------------")
id_modell = int(input("please enter the modell id(eg.2674): "))
year = int(input("please enter the year (eg.2015): "))
url = '/{}/{}/detail/1.htm'.format(id_modell, year)
xcar = xcardown()
xcar.download_xcar(url)
愛卡汽車網活動分站兩個SQL注射
1.sql注入,多個資料庫 2.可執行sql命令 3.可讀取檔案 之前的修復的很不徹底.詳細說明 貌似沒修理完,很多了 注入資訊 target host ip 118.67.112.75 web server apache db server mysql current db xcardb2 1.查...
愛馳CEO谷峰談愛馳汽車未來規劃
谷峰,作為愛馳汽車聯合創始人兼ceo,從研究生畢業之後到創業之前,他一直供職上汽20年。作為上海財經大學的高才生,谷峰的工作一直與財務有關,從上汽通用erp 財務模組的實施主管 乾到上汽集團cfo,成為上汽集團最程式設計客棧年輕的高管。直到2017年正式加入愛馳。談到愛馳未來計畫,谷峰說到 首先我們...
愛馳汽車運用多種手段鑄造智慧型工廠
愛馳汽車作為一家致力於智慧型製造 智慧型產品和運營服務加速汽車產業進化的企業,以工業4.0 標準自建具備整車資質的數位化 智慧型化 柔性化超級智慧型工廠,引領低能耗 自動化的綠色產業鏈發展潮流。在工業生產領域,愛馳汽車超級智慧型工廠以領先的智慧型製造 智慧型物流,打造低能耗 自動化的綠色生產鏈。智慧...