Keras Tokenizer中的注意點

2021-08-31 00:20:24 字數 3133 閱讀 3202

使用步驟:

1.例項化tokenizer物件,給出最大詞彙量nb_words

2.用tokenizer令牌化所有文章,把文章包裝成 list(list())的形式,詞或者字用空格分割

3.tokenizer.word_index會輸出所有詞彙與index--》也就是詞表【切記如果詞彙中包含大寫字母,會被轉成小寫,後面做初始化embedding的時候,切記要轉成大寫】

4.embedding matrix初始化的時候,要按照詞表的index來排序,採用迴圈來做,這個時候 要用dictionary的get方法來取values,因為用【】陣列的形式,如果詞彙不在詞表中,會報錯。當然如果try catch也是可以的。

def get_train_test_data_embeddingweights():

input_1_list,input_2_list,label_list = get_input_and_label_list()

qlist=

qcontentlist=

tokenizer = tokenizer(nb_words=voc_size)

token_dict = {}

# question_content_matrix={}

with open('question_id.csv','r',encoding='utf-8') as f:

content_list = f.readlines()

for i in content_list:

values = i.split(',')

# print(values)

qid = values[0]

tokenizer.fit_on_texts(qcontentlist)

sequences = tokenizer.texts_to_sequences(qcontentlist)

# token_dict_for_emb = tokenizer.word_index.items()

print(tokenizer.word_index)

# print(tokenizer.word_index)

embedding_matrix = all_embedding_dict

embed_train_matrix = np.zeros((voc_size+1,300))

# print(embedding_matrix)

# print(embedding_matrix['w107878'])

for w,i in tokenizer.word_index.items():

# print(str(w))

# print(embedding_matrix[str(w)])

embedding_vector=embedding_matrix.get(w.upper())

if embedding_vector is not none:

embed_train_matrix[i] = embedding_vector

# print(embedding_matrix.get(w))

data = pad_sequences(sequences,maxlen=max_sequence_len)

for j in range(len(content_list)):

token_dict[content_list[j].split(',')[0]]=data[j]

x1_train_list =

x2_train_list =

y_list =

for i1 in range(len(input_1_list)):

x1_train_list = np.array(x1_train_list)

x2_train_list = np.array(x2_train_list)

y_list = np.array(y_list)

# #人工打亂

# indices = np.arange(len(x1_train_list))

# np.random.shuffle(indices)

# print(indices)

# x1_train_list = x1_train_list[indices]

# x2_train_list = x2_train_list[indices]

# y_list = y_list[indices]

# val_split=0.8

#

# x_1_train = x1_train_list[:int(val_split*x1_train_list.shape[0])]

# x_1_test = x1_train_list[int(val_split*x1_train_list.shape[0]):]

# x_2_train = x2_train_list[:int(val_split*x2_train_list.shape[0])]

# x_2_test = x2_train_list[int(val_split*x2_train_list.shape[0]):]

# y_train = y_list[:int(val_split*y_list.shape[0])]

# y_test = y_list[int(val_split*y_list.shape[0]):]

return x1_train_list,x2_train_list,y_list,embed_train_matrix

這裡貼一些官方文件:

linux中 中括號 中的判斷引數

源自 http www.diybl.com course 6 system linux linuxjs 20081117 151774.html b file 若檔案存在且是乙個塊特殊檔案,則為真 c file 若檔案存在且是乙個字元特殊檔案,則為真 d file 若檔案存在且是乙個目錄,則為真 e...

從HIVE中中查詢

從hive資料庫查詢文件 by ymd 拼接sql語句 string sql select from doc file where contains name wildcard 拼接名稱查詢語句 if stringutils.isnoneempty unstructuredbean.getname ...

Spring中classpath中萬用字元號的使用

說明 無萬用字元,必須完全匹配 classpath user base beans.xml 說明 匹配零個或多個字串 只針對名稱,不匹配目錄分隔符等 例如 user a base beans.xml user b base beans.xml 但是不匹配 user base beans.xml cl...