python爬蟲爬取鏈家二手房資訊

問題一

鏈家**也有反爬蟲策略和robots限制，robots限制忽略（不然沒法爬），另外頻繁爬取會直接導致被ban，需要隔天才會解禁止。防止被ban的方法有多種，1.禁止cookie 2.設定header 3.加大爬取間隔 4.使用**。我只用了前三種方法，具體可以在settings.py 和middlewares.py裡看到。因為沒有免費好用的**，所以在爬蟲實際使用中沒用方法4，但我在middlewares.py裡也留下了相關**，可稍做參考，但需要注意那幾個**ip是不可用的。

問題二我**裡只爬取了3000套二手房**，北京市實際在售的二手房大概有兩萬套，不是我不想全爬，只是鏈家只展示100頁（3000套）的內容，排序方式我也並不清楚。我嘗試通過分區域來爬取以獲得更多的資料，但爬蟲更容易被ban，大概爬幾頁後就被禁了，目前看來只能通過使用**的方式解決。

問題三我的爬取起始頁是一直爬取到100頁，我在**裡注釋掉的 start_urls包含了北京市所有的區，如果不被ban，理論上是可以拿到北京市所有的二手房資訊的。爬取的資料有如下。

『region』: 小區

『url』: 房屋詳情頁鏈結

『houseinfo』: 房屋資訊類似| 3室2廳 | 126.4平公尺 | 南北 | 精裝 | 有電梯

『unitprice』: 每平公尺單價（元）

『totalprice』: 房屋總結（萬元）

『attention』: 被關注數

『visited』: 被經紀人帶看次數

『publishday』: 房屋發布多長時間

下面是爬蟲核心**，全部**可以上我github獲取。

# -*- coding: utf-8 -*-
import scrapy
import re
class ershoufangspider(scrapy.spider):
name = "ershoufang"
#下面是北京市所有區的起始url
# start_urls = ["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]
#實際爬取過程中我只用了預設的起始url，不容易被ban
start_urls = [""]
def parse(self, response):
houses = response.xpath(".//ul[@class='selllistcontent']/li")
for house in houses:
attention = ''
visited = ''
publishday = ''
try:
attention = house.xpath(".//div[@class='followinfo']/text()").re("\d+")[0]
visited = house.xpath(".//div[@class='followinfo']/text()").re("\d+")[1]
#因為發布日期中可能單位不是天，所以我做了簡單的轉化。
if u'月' in house.xpath(".//div[@class='followinfo']/text()").extract()[0].split('/')[2]:
number = house.xpath(".//div[@class='followinfo']/text()").re("\d+")[2]
publishday = '' + int(number)*30
elif u'年' in house.xpath(".//div[@class='followinfo']/text()").extract()[0].split('/')[2]:
number = house.xpath(".//div[@class='followinfo']/text()").re("\d+")[2]
publishday = '365'
else:
publishday = house.xpath(".//div[@class='followinfo']/text()").re("\d+")[2]
except:
print "these are some ecxeptions"
else:
pass
yield 
page = response.xpath("//div[@class='page-box house-lst-page-box'][@page-data]").re("\d+")
p = re.compile(r'[^\d]+')
if len(page)>1 and page[0] != page[1]:
next_page = p.match(response.url).group()+str(int(page[1])+1)
next_page = response.urljoin(next_page)
yield scrapy.request(next_page, callback=self.parse)

說幾個我拿資料看出來的結果。1.通過publishday我發現平均房屋留存時間變長。2.房屋均價上個月7萬，這個月大概下降3-5k。 3.北京最便宜房屋單價1.6萬/平方公尺，最貴14.9萬/平方公尺（最貴和最便宜的一直都沒賣出去）。說明房市稍有降溫。再次申明，這是從3000套房資料的統計結果，不是全量房屋統計結果，大家看看就好。

上個月爬取過幾天的資料，我決定以後每天定時爬一次，長期積累的資料肯定能分析出一些有趣的結論，我把所有爬取的資料放在方便大家獲取。同時別忘記訪問下我部落格

python爬蟲爬取鏈家二手房資訊

python爬蟲爬取鏈家二手房資訊

Python爬取鏈家二手房資訊

python爬蟲之鏈家鄭州二手房爬取

python爬蟲爬取鏈家二手房資訊

python爬蟲爬取鏈家二手房資訊

Python爬取鏈家二手房資訊

python爬蟲之鏈家鄭州二手房爬取

相關推薦