xpath與正則簡單對比
1.xpath表示式效率更高
2.正規表示式功能更強大
3.一般優先選擇xpath,解決不了再用正則
xpath提取規則
/ 逐層提取
text()提取標籤下的文字
/html/head/title/text(
)
3.//標籤名** :提取所有名為的標籤
4.//標籤名
[@屬性=『屬性值』] :提取屬性為**的標籤@屬性 代表取某個屬性
#提取div中標籤的內容
//div[@class
='tools'
]
enter password: //登入密碼,初始root
create
database dangdang;
//建立資料庫檔案
use dangdang;
//使用該資料庫檔案
create
table goods(id int(32
)auto_increment
primary
key,title varchar
(100
),link varchar
(100
)unique
,comment
varchar
(100))
;//建立goods容器儲存資訊
scrapy專案中改動部分
1.item.py
# -*- coding: utf-8 -*-
# define here the models for your scraped items
## see documentation in:
# import scrapy
class
dangdangitem
(scrapy.item)
:# define the fields for your item here like:
# name = scrapy.field()
#建立三個儲存容器
['']#起始位址
defparse
(self, response)
: item=dangdangitem(
) item[
"title"
]=response.xpath(
"//a[@dd_name='單品標題']/@title"
).extract(
) item[
"link"
]= response.xpath(
"//a[@dd_name='單品標題']/@href"
).extract(
) item[
"comment"
]=response.xpath(
"//a[@name='itemlist-review']/text()"
).extract(
)yield item
for i in
range(2
,10):
#爬取前十頁
url=
''+str
(i)+
'-cid4008149.html'
yield request(url,callback=self.parse)
3.pipeline
# -*- coding: utf-8 -*-
import pymysql
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see:
class
dangdangpipeline
(object):
defprocess_item
(self, item, spider)
: conn=pymysql.connect(host=
"127.0.0.1"
,user=
"root"
,passwd=
"root"
,db=
"dangdang"
)#連線資料庫
for i in
range(0
,len
(item[
"title"])
):title=item[
"title"
][i]
link = item[
"link"
][i]
comment = item[
"comment"
][i]
sql=
"insert into goods(title,link,comment) values('"
+title+
"','"
+link+
"','"
+comment+
"')"
#sql語句
#print(sql)
try:
conn.query(sql)
except exception as err:
print
(err)
conn.close(
)return item
scrapy爬當當網書籍資訊
本次只爬取搜尋的python相關的所有書籍 scrapy start project ddbook cd ddbook ddbook scrapy genspider t basic book dangdang.com 然後開啟 book.py 一共100頁 for i in range 1,101...
爬蟲爬取當當網書籍
初學者學習爬蟲爬取當當網會比較容易,因為噹噹沒有反爬蟲import requests from lxml import html name input 請輸入要搜尋書籍的資訊 1.準備url url format name start 1 while true print start start 1...
scrapy爬取噹噹
import scrapy from items import dangdangitem class ddspider scrapy.spider name dd allowed domains dangdang.com start urls def parse self,response 使用xp...