書蘊筆記 0 文字預處理

整體索引在此

書蘊——基於書評的人工智慧推薦系統

import re
import os
from openpyxl import load_workbook
defread_from_xlsx
(path):
wb = load_workbook(path)
ws = wb[wb.sheetnames[0]]
rows = ws.max_row
cols = ws.max_column
for row in range(2, rows + 1):
with open("書評\\format\\" + ws.cell(row, 1).value + ".txt", 'w',
encoding='utf-8') as book_file:
# book_file.write(ws.cell(row, 1).value + "\n")
contents = 
for col in range(7, cols + 1):
content = ws.cell(row, col).value
if str(content) == 'none':
continue
content = str(re.sub("<[^>]+>", " ", content))
content = str(re.sub("\n", " ", content))
# print(content)
book_file.writelines(contents)
if __name__ == '__main__':
xlsxbase = "書評\\xlsx\\"
xlsxs = os.listdir(xlsxbase)
for xlsx in xlsxs:
read_from_xlsx(xlsxbase+xlsx)

讀取乙個cell的資料，使用正規表示式」<[^>]+>」替換所有html標籤為空格，然後替換所有換行為空格，最後寫入文字時，在末尾加上換行符即可。

import os
import time
import jieba.posseg as pseg
def seg_book(book_base, book_name, outfile_path):
infile = open(book_base + book_name, 'r', encoding='utf-8')
outfile = open(outfile_path + "seg_" + book_name, 'w', encoding='utf-8')
forline
in infile:
line = line.strip()
# print(line)
words = pseg.cut(line)
forword, flag in
words:
if flag.startswith('x'):
continue
ifword
in cn_stopwords_set | en_stopwords_set:
continue
outfile.write(word + ' ')
outfile.write('\n')
outfile.close()
infile.close()
if __name__ == '__main__':
cn_stopwords_file = open("util\\stopwords_csdn.txt", 'r', encoding='utf-8')
en_stopwords_file = open("util\\stopwords_google.txt", 'r',
encoding='utf-8')
cn_stopwords_set = set(cn_stopwords_file.read().splitlines())
en_stopwords_set = set(en_stopwords_file.read().splitlines())
start = time.time()
infilebase = "書評\\format\\"
books = os.listdir(infilebase)
for book in books:
print(book + " 分詞中...")
seg_book(infilebase, book, "書評\\seg\\")
# seg_book(infilebase, "追風箏的人.txt", "書評\\seg\\")
end = time.time()
print("共計用時: %d seconds" % (end - start))

做的是原始資料的去停用詞和分詞處理，使用了jieba分詞，去掉了標點，停用詞使用的google英文停用詞和csdn某部落格提供的中文停用詞。

後期會考慮使用tf-idf來動態去除停用詞。

其實後來訓練了word2vec模型發現，很多結果不盡人意，比如「中」這個字沒有去除掉，而這個字單獨出現意味著它表示英文中的 in，應當放入停用詞當中。

是的，停用詞表不一定簡單的使用別人列出來的，知乎上查到的比較合理的做法：去除其中很常見的停用詞，然後使用tf-idf或者人工篩選去除另一部分。

因為我們是每一本書乙個模型來迭代獲取書評標籤，所以沒辦法為每一本書人工篩選，後期再使用tf-idf篩選一波吧。先做出乙個快速原型才是重中之重。

文字預處理（4）文字糾錯

一般有兩種文字糾錯的型別首先看一下non word的拼寫錯誤，這種錯誤表示此詞彙本身在字典中不存在，比如把要求誤寫為藥求把 correction 誤拼寫為 corrction 尋找這種錯誤很簡單，例如分完詞以後找到哪個詞在詞典中不存在，那麼這個詞就可能是錯誤拼出來的的詞。操作步驟找到候選...

動手學深度學習筆記2文字預處理

一文字預處理 1.四個步驟讀入文字分詞建立字典，將每個詞對映到乙個唯一的索引 index 將文字從詞的序列轉換為索引的序列，方便輸入模型讀入文字分詞建立字典將字串轉換為數字，將每個詞對映到乙個唯一的索引編號。兩個重要的分詞庫 nltk與spacy 二語言模型定義一段自然語言文字...

文字預處理（5）文字糾錯的簡單案例

上一節我們留下了，乙個小問題，就是如何對給定的英文文字語料，來進行拼寫糾錯。首先，我們給定乙個語料文字 beyes train text.txt 然後統計語料中各單詞的出現情況。import re,collections 提取語料庫中的所有單詞並且轉化為小寫 def words text retur...

書蘊筆記 0 文字預處理

文字預處理 （4）文字糾錯

動手學深度學習筆記2文字預處理

文字預處理 （5）文字糾錯的簡單案例

相關推薦

文字預處理（4）文字糾錯

文字預處理（5）文字糾錯的簡單案例