python 自然語言處理第三章

1.訪問檔案：

a.本地檔案

import os

file=open(path)----指標

file.read()----得到字串

for line in file ----遍歷檔案的每一行

b.網路檔案

from urllib import urlopen

file=urlopen(url)

file.read()

2.分詞：

tokens=nltk.word_tokenize(string)----對字串進行分詞，得到list型別

**分詞函式word_tokenize 的引數string中必須要有空格或其他標點才能分詞

type(tokens) ----list型別

3.使用unicode 進行檔案處理

模組：codecs

import codecs

file=codes.open(path,encoding="latin2")

for line in file:

line=line.encode("unicode_escape")----編碼格式unicode_escape，將所有的非ascii碼的字元轉換成「\u***x」的形式，但是在128到256之間的字元，轉換成「\***」格式。

字元編碼格式：

utf-8 ----用1-4個位元組來表示字元

utf-16 ----2位元組

utf-32 -----4位元組

gbk ----不論中英文，都是兩個位元組

latin

編碼字元：unicode,ascii

4.正規表示式的應用

模組：re

import re

使用的元字元：

^:代表以什麼開頭如:r^"ad"

$:以什麼結尾

.:任意單個字元

？：前面的字元可選

[abc]:a|b|c

*:零次或多次重複

+：：重複n次

:至少n次

:至多n次

r"abc":表示是原始字串，特殊字元不會被解釋

函式：(1).re.search("****",word) ----在查詢是否有如模式"****"的單詞

[w for w in wordlist if re.search(r"[a-z]+",w)]

(2).re.findall("****",word) -----找出單詞word的所有正規表示式「****」

cv=[(cv,w] for w in text

for cv in re.findall(r"[ptksvr][aeiou]",w)]

nltk.index(cv) ----查詢擁有正規表示式的單詞

應用a.規範文字：

詞幹提取器----nltk中的porter和lancaster

b.為文字分詞

re.split(r"",sentence)----使用正規表示式為句子分詞

python 自然語言處理第三章

《python自然語言處理》第三章加工原料文字

Python第三章異常處理

C語言第三章

python 自然語言處理第三章

《python自然語言處理》第三章 加工原料文字

Python第三章 異常處理

C語言第三章

相關推薦

《python自然語言處理》第三章加工原料文字

Python第三章異常處理