文字預處理的基礎學習

讀入文字

分詞建立字典，將每個詞對映到乙個唯一的索引（index）

將文字從詞的序列轉換為索引的序列，方便輸入模型

我們用一部英文**，即h. g. well的time machine，作為示例，展示文字預處理的具體過程。

import collections
import re
def read_time_machine()
:with
open
('/home/kesci/input/timemachine7163/timemachine.txt'
,'r'
)as f:
lines =
[re.
sub(
'[^a-z]+'
,' '
, line.
strip()
.lower()
)for line in f]
return lines
lines =
read_time_machine()
print
('# sentences %d'
%len
(lines)
)

我們對每個句子進行分詞，也就是將乙個句子劃分成若干個詞（token），轉換為乙個詞的序列。

def tokenize
(sentences, token=
'word'):
"""split sentences into word or char tokens"
""if token ==
'word'
:return
[sentence.
split
(' '
)for sentence in sentences]
elif token ==
'char'
:return
[list
(sentence)
for sentence in sentences]
else
:print
('error: unkown token type '
+token)
tokens =
tokenize
(lines)
tokens[0:
2]

為了方便模型處理，我們需要將字串轉換為數字。因此我們需要先構建乙個字典（vocabulary），將每個詞對映到乙個唯一的索引編號。

class
vocab
(object)
:    def __init__
(self, tokens, min_freq=
0, use_special_tokens=false)
:        counter =
count_corpus
(tokens)  # : 
self.token_freqs =
list
(counter.
items()
)        self.idx_to_token =
if use_special_tokens:
# padding, begin of sentence, end of sentence, unknown
self.pad, self.bos, self.eos, self.unk =(0
,1,2
,3)            self.idx_to_token +=[''
,'',''
,'']else
:            self.unk =
0            self.idx_to_token +=[''
]        self.idx_to_token +=
[token for token, freq in self.token_freqs
if freq >= min_freq and token not in self.idx_to_token]
self.token_to_idx =
dict()
for idx, token in
enumerate
(self.idx_to_token)
:            self.token_to_idx[token]
= idx
def __len__
(self)
:return
len(self.idx_to_token)
def __getitem__
(self, tokens)
:if not isinstance
(tokens,
(list, tuple)):
return self.token_to_idx.
get(tokens, self.unk)
return
[self.
__getitem__
(token)
for token in tokens]
def to_tokens
(self, indices)
:if not isinstance
(indices,
(list, tuple)):
return self.idx_to_token[indices]
return
[self.idx_to_token[index]
for index in indices]
def count_corpus
(sentences)
:    tokens =
[tk for st in sentences for tk in st]
return collections.
counter
(tokens)  # 返回乙個字典，記錄每個詞的出現次數

我們看乙個例子，這裡我們嘗試用time machine作為語料構建字典

vocab =
vocab
(tokens)
print
(list
(vocab.token_to_idx.
items()
)[0:
10])

使用字典，我們可以將原文本中的句子從單詞序列轉換為索引序列

for i in
range(8
,10):
print
('words:'
, tokens[i]
)print
('indices:'
, vocab[tokens[i]
])

我們前面介紹的分詞方式非常簡單，它至少有以下幾個缺點:

標點符號通常可以提供語義資訊，但是我們的方法直接將其丟棄了

類似「shouldn』t", "doesn』t"這樣的詞會被錯誤地處理

類似"mr.", "dr."這樣的詞會被錯誤地處理

我們可以通過引入更複雜的規則來解決這些問題，但是事實上，有一些現有的工具可以很好地進行分詞，我們在這裡簡單介紹其中的兩個：spacy和nltk。

下面是乙個簡單的例子：

text =

"mr. chen doesn't agree with my suggestion."

spacy:

import spacy
nlp = spacy.
load
('en_core_web_sm'
)doc =
nlp(text)
print
([token.text for token in doc]
)

nltk:

from nltk.tokenize import word_tokenize
from nltk import data
data.path.
('/home/kesci/input/nltk_data3784/nltk_data'
)print
(word_tokenize
(text)
)

文字預處理

常見預處理步驟，預處理通常包括四個步驟讀入文字分詞建立字典，將每個詞對映到乙個唯一的索引 index 將文字從詞的序列轉換為索引的序列，方便輸入模型現有的工具可以很好地進行分詞，我們在這裡簡單介紹其中的兩個 spacy和nltk。text mr.chen doesn t agree with ...

文字預處理

本文章內容主要學習文字預處理的基本步驟及實現。1 讀入文字 2 分詞 3 建立詞典，將每乙個詞對映到乙個唯一的索引 4 將文字從詞的序列轉換為索引的序列，方便輸入模型此處用一部英文即h.g.well的time machine，作為示例，展示文字預處理的具體過程。def read time mac...

線性回歸文字預處理

線性回歸 1.模型為了簡單起見，這裡我們假設只取決於房屋狀況的兩個因素，即面積平方公尺和房齡年接下來我們希望探索與這兩個因素的具體關係。線性回歸假設輸出與各個輸入之間是線性關係 price warea area wage age bprice warea area wage age b...

文字預處理的基礎學習

文字預處理

文字預處理

線性回歸 文字預處理

相關推薦

線性回歸文字預處理