Jsoup從網頁提取字串保留換行給客戶端

這又是個看似簡單卻要折騰很久達不到完美效果的需求。

1.直接得到字串,但丟失了換行

document.text()

2.解析p 解析h1 解析div 萬一別人用的不是這3標籤或者彼此巢狀都會存在問題，或者過幾天改版了

或者乾脆

document.select("*")

去重複過程去巢狀過程更麻煩甚至不可為，有的文字內容本來就是重複的，那麼有沒有保留換行的提取文字

3.保留換行的提取文字

jsoup.clean(jsarticledetail.tostring(), "", whitelist.none(), new document.outputsettings().prettyprint(false));

這個保留了換行但是網頁網上的具體幾個空格仍然不理想，那麼只有把多餘的空行換成\n了，雖然可能會替換原本確實存在的空格，但是是目前最接近需求的結果了

string basecontent = jsoup.clean(jsarticledetail.tostring(), "", whitelist.none(), new document.outputsettings().prettyprint(false));
string newtext = basecontent.replaceall("\\s+", "\n");
//去掉收尾多餘的空格
string truecontent = newtext.replacefirst("\n", "").trim();

4.優化

我們可以再優化一下，當大於2個空格的時候我們再進行替換成\n，那文章中非要有2個多空格也沒有更好的方法。你也可以根據實際情況把2改為其他數字根據你解析的網頁進行調整

string basecontent = jsoup.clean(jsarticledetail.tostring(), "", whitelist.none(), new document.outputsettings().prettyprint(false));
string newtext = basecontent.replaceall("\\s", "\n");
string truecontent = newtext.replacefirst("\n", "").trim();

特別感謝正則提供lx hah

提取字串substring

substring 方法用於提取字串中介於兩個指定下標之間的字元。語法 stringobject.substring startpos,stoppos 引數說明注意 1.返回的內容是從 start開始包含start位置的字元到 stop 1 處的所有字元，其長度為 stop 減start。2....

提取字串中數字

include include include void main l if find break 有數字則退出迴圈 else printf 沒有數字請重新輸入 n gets c 沒有則重新出入 l strlen c l strlen c printf 字串長度為 d n l for i 0 i ...

python如何提取字串？

在python中，要提取文字的字串，有多中方法，如使用正規表示式，beautifulsoup或xpath等來提取。下面講解用正規表示式來提取字串。一單個位置的字串提取這種情況我們可以使用這個正規表示式來提取。舉例，乙個字串 a123b 如果我們想提取ab之間的值123，可以使用findall配...

Jsoup從網頁提取字串保留換行給客戶端

提取字串substring

提取 字串中 數字

python如何提取字串？

相關推薦

提取字串中數字