c 正規表示式對網頁進行內容抓取

原文 c# 正規表示式對網頁進行內容抓取

搜尋引擎中乙個比較重要的環節就是從網頁中抽取出有效內容。簡單來說，就是吧html文字中的html標記去掉,留下我們用ie等瀏覽器開啟html文件看到的部分（我們這裡不考慮）.
將html文字中的標記分為:注釋,script ,style，以及其他標記分別去掉： 
1.去注釋,正則為: 
output = regex.replace(input, @"", string.empty, regexoptions.ignorecase); 
2.去script,正則為: 
ouput = regex.replace(input, @"", string.empty, regexoptions.ignorecase | regexoptions.singleline); 
output2 = regex.replace(ouput , @"", string.empty, regexoptions.ignorecase | regexoptions.singleline); 
3.去style,正則為: 
output = regex.replace(input, @"", string.empty, regexoptions.ignorecase | regexoptions.singleline); 
4.去其他html標記 
result = result.replace(" ", " "); 
result = result.replace(""", "\""); 
result = result.replace("<", "<"); 
result = result.replace(">", ">"); 
result = result.replace("&", "&"); 
result = result.replace("
", "\r\n"); 
result = regex.replace(result, @"<[\s\s]*?>", string.empty, regexoptions.ignorecase); 
以上的**中大家可以看到,我使用了regexoptions.singleline引數，這個引數很重要，他主要是為了讓"."(小圓點)可以匹配換行符.如果沒有這個引數，大多數情況下，用上面列正規表示式來消除網頁html標記是無效的. 
html發展至今，語法已經相當複雜,上面只列出了幾種最主要的標記,更多的去html標記的正則我將在 
rost webspider 的開發過程中補充進來。 
下面用c#實現了乙個從html字串中提取有效內容的類: 
using system; 
using system.collections.generic; 
using system.text; 
using system.text.regularexpressions; 
class htmlextract 
public override string extracttext() 
#endregion 
#region private methods 
private string removecomment(string input) 
private string removestyle(string input) 
private string removescript(string input) 
private string removetags(string input) 
#endregion

c 正規表示式對網頁進行有效內容抽取

搜尋引擎中乙個比較重要的環節就是從網頁中抽取出有效內容。簡單來說，就是吧html文字中的html標記去掉,留下我們用ie等瀏覽器開啟html文件看到的部分我們這裡不考慮將html文字中的標記分為注釋,script style，以及其他標記分別去掉 1.去注釋,正則為 output regex....

ObjC利用正規表示式抓取網頁內容

在開發專案的過程，很多情況下我們需要利用網際網路上的一些資料，在這種情況下，我們可能要寫乙個爬蟲來爬我們所需要的資料。一般情況下都是利用正規表示式來匹配html,獲取我們所需要的資料。一般情況下分以下三步。1 獲取網頁的html 2 利用正規表示式，獲取我們所需要的資料 3 分析，使用獲取到的資料，...

正規表示式抓取網頁資訊

宣告此正規表示式只適用於.net 使用的流程為傳送http請求返回整個html網頁，然後從此html頁面抓取想要的資料。第一部分傳送httpwebrequest 請求 url 位址瀏覽器型別設定 request.useragent mozilla 4.0 compatible msie 7.0...

c 正規表示式對網頁進行內容抓取

c 正規表示式對網頁進行有效內容抽取

ObjC利用正規表示式抓取網頁內容

正規表示式抓取網頁資訊

相關推薦