自然語言處理資料清洗清洗文字中html標籤

一段本文中既有文字，又有很多html標籤，很亂，需要進行清洗，下面是用python 進行過濾辣雞html的指令碼。

# -*- coding:utf-8 -*-
import pandas as pd
import re
import jieba
deffilter_tags
(htmlstr):
"""    # python通過正規表示式去除(過濾)html標籤
:param htmlstr:
:return:
"""# 先過濾cdata
re_cdata = re.compile('//]∗ //\
cdata\[[ >]∗ //\
\] > ',re.i) #匹配cdata
re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.i)
# script
re_style = re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', re.i)
# style
re_br = re.compile('
')    # 處理換行
re_h = re.compile(']*>')
# html標籤
re_comment = re.compile('')
# html注釋
s = re_cdata.sub('', htmlstr)
# 去掉cdata
s = re_script.sub('', s)  # 去掉script
s = re_style.sub('', s)
# 去掉style
s = re_br.sub('\n', s)
# 將br轉換為換行
s = re_h.sub('', s)  # 去掉html 標籤
s = re_comment.sub('', s)
# 去掉html注釋
# 去掉多餘的空行
blank_line = re.compile('\n+')
s = blank_line.sub('\n', s)
s = replacecharentity(s)  # 替換實體
return s
defreplacecharentity
(htmlstr):
"""    :param htmlstr:html字串
:function:過濾html中的標籤
"""char_entities = 
re_charentity = re.compile(r'&#?(?p\w+);')
sz = re_charentity.search(htmlstr)
while sz:
entity = sz.group()  # entity全稱，如》
key = sz.group('name')  # 去除&;後entity,如》為gt
try:
htmlstr = re_charentity.sub(char_entities[key], htmlstr, 1)
sz = re_charentity.search(htmlstr)
except keyerror:
# 以空串代替
htmlstr = re_charentity.sub('', htmlstr, 1)
sz = re_charentity.search(htmlstr)
return htmlstr
defrepalce
(s, re_exp, repl_string):
return re_exp.sub(repl_string,s)
defcleaning_data
(x):
m2=str(x).replace('       ','').replace('
','').replace('
','').replace('
','').replace('','').replace('       ','').replace('','').strip()
m3=filter_tags(m2)
m4=replacecharentity(m3)
print(m4)
if __name__ == '__main__':
# 讀取資料
data = pd.read_csv('c:\\users\\xiaohu\\desktop\\香蕉球使用者話題\\香蕉球使用者話題.csv')
# print(data)
for each in data.iloc[:,3]:
# print(each)
cleaning_data(each)

自然語言處理文字的表示

在分詞之後，如果想要做一些有意思的事情比如文字分類和句子相似度分析等就需要文字的表示。文字的表示就是通過向量來表示單詞句子以及文章。首先來看下單詞的表示，有很多種方法。今天我們先來介紹乙個最簡單的方法。單詞的表示首先需要有乙個詞典，這裡假設我們的詞典是這樣的 word dic 我們去爬山 ...

小語種nlp文字預處理資料清洗

開始繼續完成大資料實驗室招新題 roman urdu小語種為例 link 本練習賽所用資料，是名為 roman urdu dataset 的公開資料集。這些資料，均為文字資料。原始資料的文字，對應三類情感標籤 positive,negative,netural。本練習賽，移除了標籤為netural的...

統計自然語言處理文字分類

文字表示有向量空間模型，片語表示法，概念表示法目前文字通常採用向量空間模型表示向量空間模型 vsm 給定乙個文件d t1，w1 t2，w2 tn，wn d符合以下兩條約定 1 各個特徵項tk 1 k n 互異即沒有重複 2 各個特徵項tk無先後順序關係即不考慮文件的內部結構在以上兩個約定下...

自然語言處理 資料清洗 清洗文字中html標籤

自然語言處理 文字的表示

小語種nlp文字預處理 資料清洗

統計自然語言處理 文字分類

相關推薦

自然語言處理資料清洗清洗文字中html標籤

自然語言處理文字的表示

小語種nlp文字預處理資料清洗

統計自然語言處理文字分類