Keras 文本TOPIC分类小结
Keras 文本TOPIC分类小结
1.任务简介
对一段输入文本预测其类别。因时间有限,只在20 news group数据集上进行实验。
以下是20 news group数据集的简介
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g.comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:
comp.graphics |
rec.autos |
sci.crypt |
misc.forsale |
talk.politics.misc |
talk.religion.misc |
2. 数据处理
2.1 预处理
通常的处理顺序为,去除文本中的标点符号,去除无用词,去除大小写差异。这里我用http://web.ist.utl.pt/~acardoso/datasets/这里提供的已处理好的文本:
- 1.
all-terms Obtained
from the original datasets by applying the following transformations:
- 1. Substitute TAB, NEWLINE and RETURN characters by SPACE.
- 2. Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
- 3. Turn all letters to lowercase.
- 4. Substitute multiple SPACES by a single SPACE.
- 5. The title/subject of each document is simply added in the beginning of the document's text.
- 2. no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
- 3. no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
- 4. stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.
然后将类别标签转换为整形标签。
2.2 文本转特征
转特征过程中使用两种不同的方法,分别对应不同的模型DNN和LSTM
a) word转为词序号
从训练语料统计获得单词列表,并按照词频从大到小排序,序号从0开始,然后将句子中单词全部转为序号
b) word 转为词向量
用google的word2vec工具,根据训练文本生成单词对应的词向量。需要注意的是,此工具生成的词典中带有一个 名为 </s> 的单词, 它是换行,回车符转换过来的,无视此条目即可。当测试语料中出现集外词时,使用全0填充vector。
本实验中,vector的size设为了48,即word转换为了48维的词向量。
3.实验进行
列一下DNN,LSTM变体GRU两个模型的实验代码
a) DNN
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Reshape
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from keras.preprocessing.text import Tokenizer
import pickle
import LoadOriData
batch_size = 32
maxlen = 100
max_features = 1000
print("Loading data...")
X_train, Y_train = LoadOriData.Process('20ng-train-stemmed.txt', nb_words=max_features)
X_test, Y_test = LoadOriData.Process('20ng-test-stemmed.txt', nb_words=max_features)
print(len(X_train), 'train sequences')
tokenizer = Tokenizer(nb_words=max_features)
X_train = tokenizer.sequences_to_matrix(X_train, mode="binary")
X_test = tokenizer.sequences_to_matrix(X_test, mode='binary')
print('X_train shape:', X_train.shape)
#print('Y_train shape:', Y_train.shape)
print('Build model...')
model = Sequential()
model.add(Dense(512, input_shape=(max_features,), activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))
# try using different optimizers and different optimizer configs
model.compile(loss='categorical_crossentropy', optimizer='adam', class_mode="categorical")
json_string = model.to_json()
print(json_string)
f = open('20mlp_model.txt', 'w')
f.write(json_string)
f.close()
print("Train...")
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=5, show_accuracy=True)
model.save_weights('20mlp_weights.h5', overwrite=True)
score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size, verbose=1, show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)
注:
- 使用tokenizer.sequences_to_matrix 将词序号组成的序列转换为0,1值的序列。本实验使用max_features = 1000,即只记录了top 1000个词每个单词是否出现的信息。于是输入层size为1000
- 为了方便,我在预处理的时候把输出,即类别标签已转换为了0,1序列,所以输出不再需要处理。 不过keras自带工具,keras.utils. np_utils可以完成转换,例如,若y_test为整型的类别标签,Y_test = np_utils.to_categorical(y_test, nb_classes), Y_test将得到0,1序列化的结果。
- 本实验DNN模型结构为 1000*512*20,dropout为50%, 注意最后一层激活函数为softmax, 模型的损失函数设为categorical_crossentropy 类别预测的交叉熵,class_mode设为categorical
b) GRU
from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.preprocessing import sequence
from keras.optimizers import SGD, RMSprop, Adagrad
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Reshape
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
import pickle
import os
batch_size = 32
weights_file = '20lstm_weights.h5'
print("Loading data...")
f=open('train.pkl', 'r')
X_train, Y_train = pickle.load(f)
f.close()
print('X_train shape:', X_train.shape)
print('Y_train shape:', Y_train.shape)
print('Build model...')
model = Sequential()
model.add(GRU(output_dim=128,input_dim = 48, activation='tanh', inner_activation='hard_sigmoid', input_length=100)) # try using a GRU instead, for fun
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))
# try using different optimizers and different optimizer configs
model.compile(loss='categorical_crossentropy', optimizer='adam', class_mode="categorical")
json_string = model.to_json()
print(json_string)
print("Train...")
if os.path.exists(weights_file):
model.load_weights(weights_file)
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=4, show_accuracy=True)
model.save_weights(weights_file, overwrite=True)
f=open('test.pkl', 'r')
X_test, Y_test = pickle.load(f)
f.close()
score, acc = model.evaluate(X_test, Y_test, batch_size=batch_size, verbose=1, show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)
注:
- train.pkl, test.pkl 根据word2vec结果,截断前100个词 生成X,Y。X为48维,时序长度100, 对于集外词填充全0; Y为20(总类别数目)维, 所属类别维度设置为1,其余为0。
- model.evaluate 直接可以计算误差和准确率(只有分类任务才有意义)
- LSTM的权重包含,U_c,U_f,U_i,U_o W_c,W_f,W_i,W_o b_c,b_f,b_i,b_o共12个参数, 查看其数值用 eval()函数。 gru权重包括U_h,U_r,U_z, W_h,W_r,W_z b_h,b_r,b_z
- LSTM和GRU的计算公式参考 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 。需要注意的是,文章中的W矩阵在由keras中的U和W这两个矩阵组合成的,文中相当于在计算公式上做了一个合并
4 结论
4.1 DNN
迭代5次,集内准确率92%, 集外70%
In [156]: run 20ng_mlp.py
Loading data...
11293 train sequences
X_train shape: (11293, 1000)
Build model...
{"layers": [{"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "input_shape": [1000], "init": "glorot_uniform", "activation": "tanh", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 512}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "softmax", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 20}], "loss": "categorical_crossentropy", "theano_mode": null, "name": "Sequential", "class_mode": "categorical", "optimizer": {"beta_1": 0.9, "epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001, "name": "Adam"}}
Train...
Epoch 1/5
11293/11293 [==============================] - 9s - loss: 1.4124 - acc: 0.6726
Epoch 2/5
11293/11293 [==============================] - 9s - loss: 0.6080 - acc: 0.8343
Epoch 3/5
11293/11293 [==============================] - 10s - loss: 0.4347 - acc: 0.8773
Epoch 4/5
11293/11293 [==============================] - 9s - loss: 0.3388 - acc: 0.9059
Epoch 5/5
11293/11293 [==============================] - 9s - loss: 0.2772 - acc: 0.9212
7528/7528 [==============================] - 1s
Test score: 1.12070538341
Test accuracy: 0.701248671626
4.2 GRU
迭代了24次(前面先迭代了20次,本次只截了最后4次的log)。
集内准确率86%,集外75%
In [1]: run 20ng_lstm.py
Loading data...
X_train shape: (11293, 100, 48)
Y_train shape: (11293, 20)
Build model...
/usr/local/lib/python2.7/dist-packages/Theano-0.7.0-py2.7.egg/theano/scan_module/ scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
from scan_perform.scan_perform import *
{"layers": [{"truncate_gradient": -1, "name": "GRU", "inner_activation": "hard_sigmoid", "output_dim": 128, "input_shape": [100, 48], "init": "glorot_uniform", "inner_init": "orthogonal", "input_dim": 48, "return_sequences": false, "activation": "tanh", "input_length": 100}, {"p": 0.5, "name": "Dropout"}, {"b_constraint": null, "name": "Dense", "activity_regularizer": null, "W_constraint": null, "init": "glorot_uniform", "activation": "softmax", "input_dim": null, "b_regularizer": null, "W_regularizer": null, "output_dim": 20}], "loss": "categorical_crossentropy", "theano_mode": null, "name": "Sequential", "class_mode": "categorical", "optimizer": {"beta_1": 0.9, "epsilon": 1e-08, "beta_2": 0.999, "lr": 0.001, "name": "Adam"}}
Train...
Epoch 1/4
11293/11293 [==============================] - 124s - loss: 0.4315 - acc: 0.8681 56/11293 [=========================>....] - ETA: 15s - loss: 0.4299 - acc: 0.8693 Epoch 3/4
11293/11293 [==============================] - 118s - loss: 0.4081 - acc: 0.8756
Epoch 4/4
11293/11293 [==============================] - 130s - loss: 0.3950 - acc: 0.8837
7528/7528 [==============================] - 21s
Test score: 0.863923724031
Test accuracy: 0.758368756642
4.3 总结
1. LSTM/GRU 每次迭代运行时间大概是DNN的11倍, 迭代次数也需要比DNN多,不过其集外准确率要强过DNN。
2. 当增加LSTM/GRU的窗长时(时序长度),每次迭代的准确率会变优,但运行时间变长
3. GRU表现比LSTM好:准确率高,且运行速度快。时间关系,实验时没有留下证据,但运行结果是GRU优于LSTM。GRU介绍参考此网址 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
5 其他
- 如何查看中间层的输出结果:
以DNN的model为例:
model2 = Sequential()
model2.add(Dense(512, input_shape=(max_features,), activation='tanh', weights = model.layers[0].get_weights()))
model2.compile(loss='categorical_crossentropy', optimizer='adam', class_mode = "categorical")
然后TT=model2.predict(X_test, batch_size=..), 获得的就是第一层之后的输出
- Dropout 的应用:以系数0.5为例
a) 它在训练时通过概率来将1/2的神经元disable掉
b) 在预测的时候,是将(W*X+B)*dropout_rate 来作输出
- 模型结构的保存可以用model.to_json(),然后保存字符串
- 模型权重的保存model.save_weights('20mlp_weights.h5', overwrite=True)来保存h5格式的文件。
或者model.get_weights() 将获得所有 有权重参数层(dropout层就没有权重参数)的权重。例如,DNN的结果是 length 为4的array, array[0],array[1]为1000*512那一层的W和b; 而array[2], array[3]为512*20那一层的W和b