Python爬蟲實戰查詢企業股東有哪些關聯公司

【實驗目的】

分析企業法人還有哪些關聯公司，關聯公司註冊了哪些網域名稱，這些網域名稱是在哪個雲平台註冊的。

【思路】

1、對比測試，在不登陸的情況下，「企查查」比「天眼查」能檢視到更多的資訊。查詢企業法人的關聯公司，選擇用**「企查查」。本文主要分析這個環節。

2、通過企業名，在「企查查」上獲取法人的關聯公司。然後用所有的關聯公司在「站長之家」上查詢他們的網域名稱和所使用的雲平台。此環節參考：

4、從法人的介紹資訊中，把所有的xx公司的名字給提取出來。

5、首先分析第乙個網頁的原始碼，找到表示企業的**字串「firm_78668b40a82cd573c904c8891786102d.html」。

正規表示式為pat = 'addsearchindex.*?href="(.*?)" target="_blank" class="ma_h1"'

6、接著分析第二個網頁的原始碼，找打表示法人的**字串「pl_p1f05aa67cc68fa068f97cbd330e225b.html」

正規表示式為pat = 'btn-touzi.*?href="(.*?)".*?他關聯'

7、分析法人詳情中的頁面的原始碼，提取xx公司。發現規律，公司名字前的表示符都有firm_，包含該識別符號的數量剛好等於法人詳情頁面中關聯的公司數量。說明這可以作為查詢的關鍵字。

正規表示式為pat=』firm_.*?>(.*?)』

【爬蟲結果】

【爬蟲**】

#!/usr/bin/python3
#-*- coding: utf-8 -*-
import urllib.request
import re
#人可以識別的路徑，編碼型別為utf-8，即漢語
testurl="" + chinacompany
print("visit web:"+testurl)
#轉化為機器可以識別帶中文的**，編碼型別為unicode。只轉換漢字部分，不能全部**進行轉換
company=urllib.parse.quote(chinacompany)
testurl="" + company
print("visit web:"+testurl)
#瀏覽器偽裝池，將爬蟲偽裝成瀏覽器，避免被**遮蔽
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)
#爬取第乙個頁面，即搜尋企業名字，獲得訪問企業資訊的跳轉鏈結
searchret=urllib.request.urlopen(testurl).read().decode("utf-8", "ignore")	
matchpat='addsearchindex.*?href="(.*?)" target="_blank" class="ma_h1"'
nexturls =  re.compile(matchpat, re.s).findall(searchret)
nexturl =  "" + str(nexturls[0])
#爬取第二個頁面，即檢視企業詳細資訊，法人及關聯公司
searchret=urllib.request.urlopen(nexturl).read().decode("utf-8", "ignore")
matchpat = 'btn-touzi.*?href="(.*?)".*?他關聯'
nexturls =  re.compile(matchpat, re.s).findall(searchret)
bossnum = len(nexturls)
#迴圈找出每個boss的關聯公司有哪些
for idx in range(bossnum):
#爬取第三個頁面，檢視股東有哪些關聯公司
nexturl = "" +  str(nexturls[idx])
searchret=urllib.request.urlopen(nexturl).read().decode("utf-8", "ignore")
matchpat = 'class="cvlu">(.*?)的合作夥伴'
bossname = re.compile(matchpat, re.s).findall(searchret)[0]
print("**********=")
print("股東的詳細資訊可檢視："+ nexturl + "。    他的關聯公司如下：") 
matchpat = 'firm_.*?>(.*?)'
relatedcompany = re.compile(matchpat, re.s).findall(searchret)
print(relatedcompany)

python爬蟲實戰

python python基礎 python快速教程 python學習路線圖 python大資料學習之路 python爬蟲實戰 python pandas技巧系量化小講堂 python機器學習入門資料梳理學習群大資料 python資料探勘2 323876621 r r語言知識體系怎樣學習r ...

Python爬蟲實戰（二）

實驗介紹本實驗通過使用beautifulsoup方法對網頁進行簡單的爬取工作,並對beatifulsoup方法進行簡單的介紹。beautifulsoup開發手冊示例網頁如下實驗內容從本地網頁爬取商品資訊，商品名，評分等級等相關資訊實驗 from bs4 import beautifulso...

Python爬蟲實戰2 0

這次實戰的內容是非同步載入非同步載入和普通的數字下表迭代的url不同的地方在於不能直接通過乙個for迴圈來獲取每乙個頁面的內容。如何判別翻頁是否是非同步載入的呢？開啟瀏覽器檢查，然後定位到頁面內容的那部分html 然後在瀏覽器按下翻頁按鈕，如果發現html 部分內容閃了一下，那麼說明網頁是通過非同...

Python爬蟲實戰 查詢企業股東有哪些關聯公司

python爬蟲實戰

Python爬蟲實戰（二）

Python爬蟲實戰2 0

相關推薦

Python爬蟲實戰查詢企業股東有哪些關聯公司