兩個問題
b. 如何提取這些頁面的發帖時間
分析:
發現很有規律。規則差不多是這樣:
發帖時間都這樣的:[2008-08-09 14:51:35]
規則:\[ (\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)\]
regular expression syntax
syntax
explanation
characters c
the character c
\ unnnn , \ xnn,
\ 0n , \ 0nn , \ 0nnn
the code unit with the given hex or octal value
\ t, \ n, \ r, \ f, \ a, \e
the control characters tab, newline, return, form feed, alert, and escape
\ cc
the control character corresponding to the character c
character classes [
c1c2 . . .]
any of the characters represented by c
1 , c
2 , . . . the ci
are characters, character ranges (c
1 -c
2 ), or character classes
[^ . . .]
complement of character class
[ . . . && . . .]
intersection of two character classes
predefined character classes .
any character except line terminators (or any character if the dotall flag is set)\d
a digit [0-9 ]\d
a nondigit [^0-9 ]\s
a whitespace character [ \t\n\r\f\x0b ]\s
a non-whitespace character\w
a word character [a-za-z0-9 _]\w
a nonword character\p
a named character class—see table 12-9 \p
the complement of a named character class
boundary matchers
^ $beginning, end of input (or beginning, end of line in multiline mode)\b
a word boundary\b
a nonword boundary
syntax
explanation \a
beginning of input\z
end of input\z
end of input except final line terminator\g
end of previous match
quantifiers x?
optional x
x*x, 0 or more times
x +x, 1 or more times
x x x
x n times, at least n times, between n and m times
quantifier suffixes ?
turn default (greedy) match into reluctant match+
turn default (greedy) match into possessive match
set operations xy
any string from x , followed by any string from y
x|yany string from x or y
grouping
(x)capture the string matching x as a group\n
the match of the n th group
escapes \
c the character c (must not be an alphabetic character)\q
. . . \e
quote . . . verbatim
(? . . . )
special construct—see api notes of pattern class
從html中去除標籤,提取正文的正規表示式:
||]*>||
上傳乙個正規表示式測試工具:
Java正規表示式
正規表示式結構簡介 字元 x 字元 x 反斜槓 0n 十進位制數 0 n 7 0nn 十進位制數 0nn 0 n 7 0mnn 十進位制數 0mnn 0 m 3,0 n 7 xhh 十六進製制數 0xhh uhhhh 十六進製制數 0xhhhh t 製表符 u0009 n 換行符 u000a r 回...
Java正規表示式
方便查詢的東西 基本語法 轉義字元 in d d d 數字0 9 多少到多少 d 非數字 0 9 非 w 單詞字元 a za z0 9 a3 w 非單詞字元 w s 空白 如 n t 0 1次 1 n次 0 n次 必須是n次 大於等於n次 n demo 中文 u0391 uffe5 英文 a za ...
Java正規表示式
舉例說明 the 開頭一定要有 the 字串 of despair 結尾一定要有 of despair 的字串 那麼,abc 就是要求以abc開頭和以abc結尾的字串,實際上是只有abc匹配。notice 匹配包含notice的字串。你可以看見如果你沒有用我們提到的兩個字元 最後乙個例子 就是說 模...