Python正規表示式基礎

#這個庫在python 3中自帶，如果沒有可以通過pip install re進行安裝
import re

在爬蟲過程中，正規表示式可以在網頁源**中提取到有用的資訊。

基礎1全域性匹配函式是用格式 re.compile(正規表示式).findall(源字串)

普通字元

正常匹配

\n匹配換行符

\t匹配製表符

\w匹配字母、數字、下劃線

\w匹配除字母、數字、下劃線

\d匹配十進位制數字

\d匹配除十進位制數字

\s匹配空白字元

\s匹配除空白字元

[ab89x]

原子表，匹配ab89x中1的任意乙個

[^ab68x]

原子表，匹配除ab89x中1的任意乙個

例項1

import re
string =
'csdnet'
pat =
'dn'
res = re.
compile
(pat)
.findall(string)
print
(res[0]
)#輸出為['dn']

基礎2

普通字元

正常匹配

.匹配除換行符外任乙個字元

^匹配開始位置

$匹配結束位置

*前乙個字元出現0\1\多次

?前乙個字元出現0\1次

+前乙個字元出現1\多次

前乙個字元恰好出現n次

前乙個字元至少出現n次

前乙個字元至少出現n次，至多出現m次

()模式單元，通俗來說想提取出什麼內容，就在正則中用小括號將其括起來

例項2

import re
string =
'editorcsdnnet'
pat =
'dn...'
res = re.
compile
(pat)
.findall(string)
print
(res[0]
)#輸出結果為['dnnet']

基礎3

貪婪模式：盡可能多地匹配

懶惰模式：盡可能少地匹配，精準模式

預設貪婪模式

如果出現下列組合，則代表懶惰模式：*？+？

例項3

import re
string =
'editorcsdnnet'
pat =
'd.*?t'
res = re.
compile
(pat)
.findall(string)
print
(res)
#輸出結果為['dit', 'dnnet']   懶惰模式，精準匹配

基礎4

模式修正符，在不改變正規表示式的情況下通過模式修正符使匹配結果發生改變

普通字元

正常匹配

re.s

讓.也可以匹配多行

re.l(大寫的 i )

讓匹配時忽略大小寫

例項4

import re
string =
'editorcsdnnet'
pat =
'rcs'
res = re.
compile
(pat,re.i)
.findall(string)
print
(res)
#輸出結果為['rcs']

import re
string =
"""edito
rcsdnne"""
pat =
'e.*e'
res = re.
compile
(pat,re.s)
.findall(string)
print
(res)
#輸出結果為['edito\nrcs\ndnne']

下面附加乙個例項：

#利用正規表示式和requests獲取前程無憂的職位資訊
#下面是完整**
#_*_coding:utf-8_*_
#檔案:爬取前程無憂.py
#ide :pycharm
import requests
import re
hd =
city =
input
("請輸入城市拼音："
)#注意這裡一定是拼音小寫的，**限制。。
job =
input
("請輸入職位："
)txt = requests.get(
""+city)
#所輸入城市的前程無憂主頁
data =
bytes
(txt.text,txt.encoding)
.decode(
"gbk"
,"ignore"
)pat_city_id=
''city_id = re.
compile
(pat_city_id,re.s)
.findall(data)[0
]#獲取到所輸入城市的id
#print(city_id)
txt_1= requests.get(
""+str
(city_id)
+",000000,0000,00,9,99,"
+job+
",2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0"
,headers = hd)
data=
bytes
(txt_1.text,txt_1.encoding)
.decode(
"gbk"
,"ignore"
)#print(data)
pat =
' .*?共(.*?)條職位.*?
'num = re.
compile
(pat,re.s)
.findall(data)[0
]#獲取所搜尋職位一共有多少
#print(num)
page =
int(num)
//50+1
#算一下一共有多少頁
#print(page)
for i in
range(0
,page)
:print
("----正在爬"
+str
(i+1)+
"頁----"
)    this_url =
(""+str
(city_id)
+",000000,0000,00,9,99,"
+job+
",2,"
+str
(i+1)+
".html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0"
)    txt_2 = requests.get(this_url,headers = hd)
data =
bytes
(txt_2.text,txt_2.encoding)
.decode(
"gbk"
,"ignore"
)#    print(data)
pat_title =
' .*?'
pat_company =
''    pat_money =
'.*?(.*?).*?
'try
:for j in
range(0
,50):
title = re.
compile
(pat_title,re.s)
.findall(data)
[j]#獲取職位
company = re.
compile
(pat_company,re.s)
.findall(data)
[j]#獲取公司
money = re.
compile
(pat_money,re.s)
.findall(data)
[j]#獲取薪資
#            print(len(money))
#            print(title)
#            print(company)
#            print(money)
#            print("--------")
with
open
("工作.doc"
,"a+"
,encoding=
'utf-8'
)as f:
f.write(title+
"\r\n"
+company+
"\r\n"
+money+
"\r\n"
+"------\r\n"
)except exception as ess:
pass

這個例項在另外乙個博文中詳細說明了，博文位址。

Python 正規表示式（基礎）

正規表示式 regular expression 是乙個特殊的字串行，描述了一種字串匹配的模式可以用來檢查乙個串是否含有某種子串將匹配的子串替換或者從某個串中取出符合某個條件的子串，或者是在指定的文章中，抓取特定的字串等。python處理正規表示式的模組是re模組，它是python語言擁有全部的正...

Python正規表示式基礎

直接給出字元就是精確匹配。特殊字元首先需要轉義如 d 匹配乙個數字，w 匹配乙個字母或者數字。123 d 可以匹配 1231 但是無法匹配 123a d d d 可以匹配到 123 w w w 可以匹配到 py3 表示任意乙個字元，py.可以表示py3 py 等表示任意長個字元，表示至少乙個字元，...

python基礎（正規表示式）

正規表示式用於搜尋替換和解析字串。正規表示式遵循一定的語法規則，使用非常靈活，功能強大。使用正規表示式編寫一些邏輯驗證非常方便，例如電子郵件位址格式的驗證。python提供了re模組實現正規表示式的驗證。1.簡介正規表示式是用於文字匹配的工具，它在源字串中查詢與給定的正規表示式相匹配的部分，乙個...

Python正規表示式基礎

Python 正規表示式（基礎）

Python正規表示式基礎

python基礎（正規表示式）

相關推薦