IMDB影评倾向分类 - N-Gram
catalogue
1. 数据集 2. 模型设计 3. 训练
1. 数据集
0x1: IMDB影评数据
本数据库含有来自IMDB的25,000条影评,被标记为正面/负面两种评价
from keras.datasets import imdb (X_train, y_train), (X_test, y_test) = imdb.load_data(path="imdb_full.pkl", nb_words=None, skip_top=0, maxlen=None, test_split=0.1) seed=113, start_char=1, oov_char=2, index_from=3) 1. path:如果你在本机上已有此数据集(位于'~/.keras/datasets/'+path),则载入。否则数据将下载到该目录下 2. nb_words:整数或None,要考虑的最常见的单词数,任何出现频率更低的单词将会被编码到0的位置。 3. skip_top:整数,忽略最常出现的若干单词,这些单词将会被编码为0 4. maxlen:整数,最大序列长度,任何长度大于此值的序列将被截断 5. seed:整数,用于数据重排的随机数种子 6. start_char:字符,序列的起始将以该字符标记,默认为1因为0通常用作padding 7. oov_char:字符,因nb_words或skip_top限制而cut掉的单词将被该字符代替 8. index_from:整数,真实的单词(而不是类似于start_char的特殊占位符)将从这个下标开始
返回值两个Tuple,(X_train, y_train), (X_test, y_test),其中
X_train和X_test:序列的列表,每个序列都是词下标的列表。如果指定了nb_words,则序列中可能的最大下标为nb_words-1。如果指定了maxlen,则序列的最大可能长度为maxlen y_train和y_test:为序列的标签,是一个二值list
0x2: 数据预处理
影评已被预处理为词下标构成的序列,单词的下标基于它在数据集中出现的频率标定,例如整数3所编码的词为数据集中第3常出现的词,类似下面这张图
h的词频在数据集中的频率排名第一,所以它的编码为1
X_train[0] [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32] y_train[0] 1
将语料集转化为一个词频组合的编号集
0x2: N-Gram生成
需要注意的是,kera的dataset默认采用1-Gram的分组方式,如果我们需要使用2-Gram、N-Gram等情况,需要在预处理阶段进行拼接
if ngram_range > 1: print('Adding {}-gram features'.format(ngram_range)) # Create set of unique n-gram from the training set. ngram_set = set() for input_list in X_train: for i in range(2, ngram_range + 1): set_of_ngram = create_ngram_set(input_list, ngram_value=i) ngram_set.update(set_of_ngram) print('ngram_set.pop') print(ngram_set.pop())
将1-Gram进行"两两组合",得到2-Gram,同时组合后的2-Gram需要再次进行index编号,编号的起始就从max_features开始,因为在1-Gram的时候,我们已经把0-max_features的编号都使用过了
但是要注意一个问题,2-Gram对应的组合数量和1-Gram是几何倍数增加的,对计算和内存的要求也成倍提高了
0x3: 填充序列pad_sequences
得到的数据是一个2D的数据,shape是
X_train shape: (2500, 400) X_test shape: (2500, 400)
2500是max_features,即我们只从训练语料库中选取排名前2500词频的词,400代表maxlen,我们认为对话的长度最长为400,把所有不足400的都padding到400长度
Relevant Link:
https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py
2. 模型设计
0x1: 嵌入层
model.add(Embedding(max_features,
embedding_dims,
input_length=maxlen))
输入数据的维度就是max_features,即我们选取的词频列表中排名前max_features的词频,输入维度为embedding_dims = 50
0x2: 池化层 - 池化是一种低损降维的好方法
在通过卷积获得了特征(features)之后,下一步我们希望利用这些特征去做分类。理论上讲,人们可以把所有解析出来的特征关联到一个分类器,例如softmax分类器,但计算量非常大。例如:对于一个96X96像素的图像,假设我们已经通过8X8个输入学习得到了400个特征。而每一个卷积都会得到一个(96 − 8 + 1) * (96 − 8 + 1) = 7921的结果集,由于已经得到了400个特征,所以对于每个样例(example)结果集的大小就将达到892 * 400 = 3,168,400 个特征。这样学习一个拥有超过3百万特征的输入的分类器是相当不明智的,并且极易出现过度拟合(over-fitting).
所以就有了pooling这个方法,本质上是把特征维度区域的一部分求个均值或者最大值,用来代表这部分区域。如果是求均值就是mean pooling,求最大值就是max pooling
之所以这样做,拿图像识别举例子,是因为:我们之所以决定使用卷积后的特征是因为图像具有一种“静态性”的属性,这也就意味着在一个图像区域有用的特征极有可能在另一个区域同样适用。因此,为了描述大的图像,一个很自然的想法就是对不同位置的特征进行聚合统计。这个均值或者最大值就是一种聚合统计的方法。
另外,如果人们选择图像中的连续范围作为池化区域,并且只是池化相同(重复)的隐藏单元产生的特征,那么,这些池化单元就具有平移不变性(translation invariant)。这就意味着即使图像经历了一个小的平移之后,依然会产生相同的(池化的)特征
model.add(GlobalAveragePooling1D())
Relevant Link:
https://keras-cn.readthedocs.io/en/latest/layers/embedding_layer/ http://www.cnblogs.com/bzjia-blog/p/3415790.html https://keras-cn.readthedocs.io/en/latest/layers/pooling_layer/
3. 训练
'''This example demonstrates the use of fasttext for text classification Based on Joulin et al's paper: Bags of Tricks for Efficient Text Classification https://arxiv.org/abs/1607.01759 Results on IMDB datasets with uni and bi-gram embeddings: Uni-gram: 0.8813 test accuracy after 5 epochs. 8s/epoch on i7 cpu. Bi-gram : 0.9056 test accuracy after 5 epochs. 2s/epoch on GTX 980M gpu. ''' from __future__ import print_function import numpy as np np.random.seed(1337) # for reproducibility from keras.preprocessing import sequence from keras.models import Sequential from keras.layers import Dense from keras.layers import Embedding from keras.layers import GlobalAveragePooling1D from keras.datasets import imdb def create_ngram_set(input_list, ngram_value=2): """ Extract a set of n-grams from a list of integers. >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=2) {(4, 9), (4, 1), (1, 4), (9, 4)} >>> create_ngram_set([1, 4, 9, 4, 1, 4], ngram_value=3) [(1, 4, 9), (4, 9, 4), (9, 4, 1), (4, 1, 4)] """ return set(zip(*[input_list[i:] for i in range(ngram_value)])) def add_ngram(sequences, token_indice, ngram_range=2): """ Augment the input list of list (sequences) by appending n-grams values. Example: adding bi-gram >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]] >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017} >>> add_ngram(sequences, token_indice, ngram_range=2) [[1, 3, 4, 5, 1337, 2017], [1, 3, 7, 9, 2, 1337, 42]] Example: adding tri-gram >>> sequences = [[1, 3, 4, 5], [1, 3, 7, 9, 2]] >>> token_indice = {(1, 3): 1337, (9, 2): 42, (4, 5): 2017, (7, 9, 2): 2018} >>> add_ngram(sequences, token_indice, ngram_range=3) [[1, 3, 4, 5, 1337], [1, 3, 7, 9, 2, 1337, 2018]] """ new_sequences = [] for input_list in sequences: new_list = input_list[:] for i in range(len(new_list) - ngram_range + 1): for ngram_value in range(2, ngram_range + 1): ngram = tuple(new_list[i:i + ngram_value]) if ngram in token_indice: new_list.append(token_indice[ngram]) new_sequences.append(new_list) return new_sequences # Set parameters: # ngram_range = 2 will add bi-grams features ngram_range = 2 max_features = 5000 maxlen = 400 batch_size = 32 embedding_dims = 50 nb_epoch = 120 print('Loading data...') (X_train, y_train), (X_test, y_test) = imdb.load_data( path="/home/zhenghan/keras/imdb_full.pkl", nb_words=max_features ) # truncate the dataset truncate_rate = 0.4 X_train = X_train[:int(len(X_train) * truncate_rate)] y_train = y_train[:int(len(y_train) * truncate_rate)] X_test = X_test[:int(len(X_test) * truncate_rate)] y_test = y_test[:int(len(y_test) * truncate_rate)] print("X_train[0]") print(X_train[0]) print('y_train[0]') print(y_train[0]) print(len(X_train), 'train sequences') print(len(X_test), 'test sequences') print('Average train sequence length: {}'.format(np.mean(list(map(len, X_train)), dtype=int))) print('Average test sequence length: {}'.format(np.mean(list(map(len, X_test)), dtype=int))) if ngram_range > 1: print('Adding {}-gram features'.format(ngram_range)) # Create set of unique n-gram from the training set. ngram_set = set() for input_list in X_train: for i in range(2, ngram_range + 1): set_of_ngram = create_ngram_set(input_list, ngram_value=i) ngram_set.update(set_of_ngram) print('ngram_set.pop') print(ngram_set.pop()) # Dictionary mapping n-gram token to a unique integer. # Integer values are greater than max_features in order # to avoid collision with existing features. start_index = max_features + 1 token_indice = {v: k + start_index for k, v in enumerate(ngram_set)} indice_token = {token_indice[k]: k for k in token_indice} # max_features is the highest integer that could be found in the dataset. max_features = np.max(list(indice_token.keys())) + 1 print('max_features: ') print(max_features) # Augmenting X_train and X_test with n-grams features X_train = add_ngram(X_train, token_indice, ngram_range) X_test = add_ngram(X_test, token_indice, ngram_range) print("X_train[0]") print(X_train[0]) print('X_test[0]') print(X_test[0]) print('Average train sequence length: {}'.format(np.mean(list(map(len, X_train)), dtype=int))) print('Average test sequence length: {}'.format(np.mean(list(map(len, X_test)), dtype=int))) print('Pad sequences (samples x time)') X_train = sequence.pad_sequences(X_train, maxlen=maxlen) X_test = sequence.pad_sequences(X_test, maxlen=maxlen) print('X_train shape:', X_train.shape) print('X_test shape:', X_test.shape) print('Build model...') model = Sequential() # we start off with an efficient embedding layer which maps # our vocab indices into embedding_dims dimensions model.add(Embedding(max_features, embedding_dims, input_length=maxlen)) # we add a GlobalAveragePooling1D, which will average the embeddings # of all words in the document model.add(GlobalAveragePooling1D()) # We project onto a single unit output layer, and squash it with a sigmoid: model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(X_test, y_test))
Relevant Link:
https://keras-cn.readthedocs.io/en/latest/other/datasets/ https://keras-cn.readthedocs.io/en/latest/preprocessing/sequence/
Copyright (c) 2017 LittleHann All rights reserved