pytorch學習筆記 part4 文字預處理

選擇一部英文**，h. g. well的time machine，學習實現文字預處理的具體過程。

import collections
import re
defread_time_machine()
:with
open
(r'd:\project\textpreprocess\timemachine7163\timemachine.txt'
,'r'
)as f:
lines =
[re.sub(
'[^a-z]+'
,' '
, line.strip(
).lower())
for line in f]
# 去掉字首字尾的空白字元，將大寫英文本元轉換為小寫，將非英文本元構成的子串全部替換為空格
return lines
lines = read_time_machine(
)print
('# sentences %d'
%len
(lines)
)

對每個句子進行分詞，也就是將乙個句子劃分成若干個詞（token），轉換為乙個詞的序列。

def
tokenize
(sentences, token=
'word'):
"""split sentences into word or char tokens"""
if token ==
'word'
:return
[sentence.split(
' ')
for sentence in sentences]
elif token ==
'char'
:return
[list
(sentence)
for sentence in sentences]
else
:print
('error: unkown token type '
+token)
tokens = tokenize(lines)
tokens[0:
2]

為了方便模型處理，需要將字串轉換為數字。因此需要先構建乙個字典，將每個詞對映到乙個唯一的索引編號。

class
vocab
(object):
def__init__
(self, tokens, min_freq=
0, use_special_tokens=
false):
counter = count_corpus(tokens)
# : 
self.token_freqs =
list
(counter.items())
self.idx_to_token =
if use_special_tokens:
# padding, begin of sentence, end of sentence, unknown
self.pad, self.bos, self.eos, self.unk =(0
,1,2
,3)            self.idx_to_token +=[''
,'',''
,'']else
:            self.unk =
0            self.idx_to_token +=[''
]        self.idx_to_token +=
[token for token, freq in self.token_freqs
if freq >= min_freq and token not
in self.idx_to_token]
self.token_to_idx =
dict()
for idx, token in
enumerate
(self.idx_to_token)
:            self.token_to_idx[token]
= idx
def__len__
(self)
:return
len(self.idx_to_token)
def__getitem__
(self, tokens):if
notisinstance
(tokens,
(list
,tuple))
:return self.token_to_idx.get(tokens, self.unk)
return
[self.__getitem__(token)
for token in tokens]
defto_tokens
(self, indices):if
notisinstance
(indices,
(list
,tuple))
:return self.idx_to_token[indices]
return
[self.idx_to_token[index]
for index in indices]
defcount_corpus
(sentences)
:    tokens =
[tk for st in sentences for tk in st]
return collections.counter(tokens)
# 返回乙個字典，記錄每個詞的出現次數

使用字典可以將原文本中的句子從單詞序列轉換為索引序列

for i in
range(8
,10):
print
('words:'
, tokens[i]
)print
('indices:'
, vocab[tokens[i]
])

我們前面介紹的分詞方式非常簡單，它至少有以下幾個缺點:

1.標點符號通常可以提供語義資訊，但是我們的方法直接將其丟棄了

2.類似「shouldn』t", 「doesn』t"這樣的詞會被錯誤地處理

3.類似"mr.」, "dr."這樣的詞會被錯誤地處理

有一些現有的工具可以很好地進行分詞，比如：spacy和nltk

text =
"mr. chen doesn't agree with my suggestion."
import spacy
nlp = spacy.load(
'en_core_web_sm'
)doc = nlp(text)
print
([token.text for token in doc]
)

from nltk.tokenize import word_tokenize
from nltk import data
'/home/input/nltk_data3784/nltk_data'
)print
(word_tokenize(text)
)

Pytorch 學習筆記

本渣的pytorch 逐步學習鞏固經歷，希望各位大佬指正，寫這個部落格也為了鞏固下記憶。a a b c 表示從 a 取到 b 步長為 c c 2 則表示每2個數取1個若為a 1 1 1 即表示從倒數最後乙個到正數第 1 個最後乙個 1 表示倒著取如下陣列的第乙個為 0 啊，第 0 個！彆扭 ...

Pytorch學習筆記

資料集 penn fudan資料集在學習pytorch官網教程時，作者對penn fudan資料集進行了定義，並且在自定義的資料集上實現了對r cnn模型的微調。此篇筆記簡單總結一下pytorch如何實現定義自己的資料集資料集必須繼承torch.utils.data.dataset類，並且實現 ...

Pytorch學習筆記

lesson 1.張量 tensor 的建立和常用方法一張量 tensor 的基本建立及其型別 import torch 匯入pytorch包 import numpy as np torch.version 檢視版本號1.張量 tensor 函式建立方法張量 tensor 函式建立方法 t ...

pytorch學習筆記 part4 文字預處理

Pytorch 學習筆記

Pytorch學習筆記

Pytorch學習筆記

相關推薦