寫乙個爬蟲爬取nn-online.org的np散射的pwa93的各分波的理論值
需要先安裝python3,python-pip3,selenium,geckodriver
python crawler**如下
# wang jianfeng dec 14 2018
# python3
# install selenium first: pip3 install selenium
from selenium import webdriver
from selenium.webdriver.common.by import by
from selenium.webdriver.common.keys import keys
from selenium.webdriver.support.ui import select
from selenium.common.exceptions import nosuchelementexception
from selenium.common.exceptions import noalertpresentexception
import unittest, time, re
from urllib.request import urlopen
# install geckodriver first
driver = webdriver.firefox(
)# driver is firefox
phaselist =([
'1s0'
,'3p0'
,'1p1'
,'3s1'
,'3p1'
,'3d1'
,'e1'
,'1d2'
,'3p2'
,'3d2'
,'3f2'
,'e2'
,'1f3'
,'3p2'
,'3p2'
,'3f2'
,'e3'
,'1g4'
,'3f4'
,'3g4'
,'3h4'
,'e4'
,'1h5'
,'3g5'
,'3h5'
,'3i5'
,'e5'])
url1 =
""url2 =
"&tmax="
url3 =
"&tint=0.01&ps="
nntype =
"np_"
txt =
".txt"
tmin =
0.01
tmax =
10.00
for phase in phaselist:
fw =
open
(nntype + phase + txt,
"w",encoding=
"utf-8"
) fw.write(
"\b\b"
) tmin =
0.01
tmax =
10while tmax <=
300:
url = url1 +
str(
round
(tmin,2)
)+ url2 +
str(tmax)
+ url3 + phase
driver.get(url)
html = driver.page_source
res = re.findall(r"pwa93\b\b\b(.+?)
"
,html,flags=re.dotall)
fw.write(res[0]
.strip())
fw.write(
"\n"
)if tmax <
100:
fw.write(
"\b"
) tmin = tmin+
10 tmax = tmax+
10 fw.close(
)
由於有時nnonline返回的資料最後一行有時會有重複
再寫各python**重新編輯一下
**如下
phaselist =([
'1s0'
,'3p0'
,'1p1'
,'3s1'
,'3p1'
,'3d1'
,'e1'
,'1d2'
,'3p2'
,'3d2'
,'3f2'
,'e2'
,'1f3'
,'3p2'
,'3p2'
,'3f2'
,'e3'
,'1g4'
,'3f4'
,'3g4'
,'3h4'
,'e4'
,'1h5'
,'3g5'
,'3h5'
,'3i5'
,'e5'])
nntype =
"np_"
txt =
".txt"
dat =
".dat"
path =
"out/"
for phase in phaselist:
ii=0 fr =
open
(nntype + phase + txt,
"r",encoding=
"utf-8"
) fw =
open
(path + nntype + phase + dat,
"w",encoding=
"utf-8"
) line1 = fr.readline(8)
line2 = fr.readline(11)
line3 = fr.readline(
)while line1:
line11=line1
line22=line2
line1 = fr.readline(8)
line2 = fr.readline(11)
line3 = fr.readline(
)if line11 != line1:
fw.write(line11+line22+
'\b\n'
) ii=ii+
1 fr.close(
) fw.close(
)print
(phase +
" complete. line = "
+str
(ii)
)
Python 爬蟲爬取網頁
工具 python 2.7 import urllib import urllib2 defgetpage url 爬去網頁的方法 request urllib.request url 訪問網頁 reponse urllib2.urlopen request 返回網頁 return response...
python爬蟲爬取策略
在爬蟲系統中,待抓取url佇列是很重要的一部分。待抓取url佇列中的url以什麼樣的順序排列也是乙個很重要的問題,因為這涉及到先抓取那個頁面,後抓取哪個頁面。而決定這些url排列順序的方法,叫做抓取策略。下面重點介紹幾種常見的抓取策略 一 深度優先遍歷策略 深度優先遍歷策略是指網路爬蟲會從起始頁開始...
python爬蟲 seebug爬取
1.找相關的標籤一步一步往下查詢 2.有cookie才能查詢 3.用import re而不用from re import 是為了防止衝突 coding utf 8 from requests import import re from bs4 import beautifulsoup as bs h...