006-深度学习与NLP简单应用

Auto-Encoder

如果原始图片输入后经过神经网络压缩成中间状态（编码过程Encoder），再由中间状态解码出的图片与原始输入差别很小（D解码过程ecoder），那么这个中间状态的东西，就可以用来表示原始的输入。

原先打算用AE来做神经网络中的W，但是发现效果不好，然后神经网络使用batch 的方法来平滑损失函数曲线，然后使用神经网络的“跳层”方法优化神经网络。

所以AE用处最多的就是降维。

有一个问题，农场主假设，如果一群鸡每天10点喂食，那么鸡中比较聪明的鸡就会认为每天10点钟有食物是一种自然规律，这个鸡认为的这种自然规律在机器学习中叫做局部最优解，也就是过拟合。

同理，人类在认知世界的过程中，无法开启上帝之眼，无法跳出三维而完全的认识三维世界。　　

AE实现过程：

from keras.layers import Input, Dense
from keras.models import Model
from sklearn.cluster import KMeans


class ASCIIAutoencoder():
    """基于字符的Autoencoder."""

    def __init__(self, sen_len = 512, encoding_dim = 32, epoch = 50, val_ratio = 0.3):
        """
        Init.
        :param sen_len: 把sentences pad成相同的⻓长度
        :param encoding_dim: 压缩后的维度dim
        :param epoch: 要跑多少epoch
        :param kmeanmodel: 简单的KNN clustering模型
        """
        self.sen_len = sen_len
        self.encoding_dim = encoding_dim
        self.autoencoder = None
        self.encoder = None
        self.kmeanmodel = KMeans(n_clusters = 2)
        self.epoch = epoch

    def fit(self, x):
        """
        模型构建。
        :param x: input text
        """
        # 把所有的trainset都搞成同⼀一个size，并把每⼀一个字符都换成ascii码
        x_train = self.preprocess(x, length = self.sen_len)
        # 然后给input预留留好位置
        input_text = Input(shape = (self.sen_len,))
        # "encoded" 每⼀一经过⼀一层，都被刷新成⼩小⼀一点的“压缩后表达式”
        encoded = Dense(1024, activation = 'tanh')(input_text)
        encoded = Dense(512, activation = 'tanh')(encoded)
        encoded = Dense(128, activation = 'tanh')(encoded)
        encoded = Dense(self.encoding_dim, activation = 'tanh')(encoded)
        # "decoded" 就是把刚刚压缩完的东⻄西，给反过来还原成input_text
        decoded = Dense(128, activation = 'tanh')(encoded)
        decoded = Dense(512, activation = 'tanh')(decoded)
        decoded = Dense(1024, activation = 'tanh')(decoded)
        decoded = Dense(self.sen_len, activation = 'sigmoid')(decoded)
        # 整个从⼤大到⼩小再到⼤大的model，叫 autoencoder
        self.autoencoder = Model(input = input_text, output = decoded)
        # 那么 只从⼤大到⼩小（也就是⼀一半的model）就叫 encoder
        self.encoder = Model(input = input_text, output = encoded)
        # 同理理，我们接下来搞⼀一个decoder出来，也就是从⼩小到⼤大的model
        # 来，首先encoded的input size给预留留好
        encoded_input = Input(shape = (1024,))
        # autoencoder的最后⼀一层，就应该是decoder的第⼀一层
        decoder_layer = self.autoencoder.layers[-1]
        # 然后我们从头到尾连起来，就是⼀一个decoder了了！
        decoder = Model(input = encoded_input, output = decoder_layer(encoded_input))
        # compile
        self.autoencoder.compile(optimizer = 'adam', loss = 'mse')
        # 跑起来
        self.autoencoder.fit(x_train, x_train,
                             nb_epoch = self.epoch,
                             batch_size = 1000,
                             shuffle = True,
                             )
        # 这⼀一部分是⾃自⼰己拿⾃自⼰己train⼀一下KNN，⼀一件简单的基于距离的分类器器
        x_train = self.encoder.predict(x_train)
        self.kmeanmodel.fit(x_train)

    def predict(self, x):
        """
        做预测。
        :param x: input text
        :return: predictions
        """
        # 同理理，第⼀一步 把来的 都给搞成ASCII化，并且⻓长度相同
        x_test = self.preprocess(x, length = self.sen_len)
        # 然后⽤用encoder把test集给压缩
        x_test = self.encoder.predict(x_test)
        # KNN给分类出来
        preds = self.kmeanmodel.predict(x_test)
        return preds

    def preprocess(self, s_list, length = 256):
        ...

CNN4Text

如何迁移到文字处理理？

1. 把文字表示成图片

也可以把句子做成一维的：

CNN假设：

RNN假设：

边界处理理：
Narrow vs Wide

Stride size:
步伐大小

# 有两种⽅方法可以做CNN for text
# 每种⽅方法⾥里里⾯面，也有各种不不同的玩法思路路
# 效果其实基本都差不不多，
# 我这⾥里里讲2种最普遍的：
# 1. ⼀一维的vector [...] + ⼀一维的filter [...]
# 这种⽅方法有⼀一定的信息损失（因为你average了了向量量），但是速度快，效果也没太差。
# 2. 通过w2v或其他⽅方法，做成⼆二维matrix，当做图⽚片来做。
# 这是⽐比较“讲道理理”的解决⽅方案，但是就是慢。。。放AWS上太烧钱了了。。。
# 1. 1D CNN for Text
# ⼀一个IMDB电影评论的例例⼦子
import numpy as np

np.random.seed(1337)  # for reproducibility
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Embedding
from keras.layers import Convolution1D, MaxPooling1D
from keras.datasets import imdb

# set parameters:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 2
# ⾃自带的数据集，load IMDB data
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words = max_features)
# 这个数据集是已经搞好了了的BoW，⻓长这样：
# [123, 2, 0, 45, 32, 1212, 344, 4, ... ]
# 简单的把他们搞成相同⻓长度，不不够的补0，太多的砍掉
X_train = sequence.pad_sequences(X_train, maxlen = maxlen)
X_test = sequence.pad_sequences(X_test, maxlen = maxlen)
# 这⾥里里我们可以换成word2vec的vector，他们就是天然的相同⻓长度了了
# 留留个作业，⼤大家可以试试
# 初始化我们的sequential model (指的是线性排列列的layer)
model = Sequential()
# 亮点来了了，这⾥里里你需要⼀一个Embedding Layer来把你的input word indexes
# 变成tensor vectors： ⽐比如 如果dim是3， 那么：
# [[2],[123], ...] --> [[0.1， 0.4， 0.21]， [0.2, 0.4, 0.13], ... ]
# 其实很像word2vec的结果，只不不过这个Embedding没有什什么太⼤大意义，只是向量量化了了input的int
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length = maxlen,
                    dropout = 0.2))
# 这⼀一步对于直接⽤用BoW（⽐比如这个IMDB的数据集）很⽅方便便，但是对我们⾃自⼰己的word vector，
# 就不不友好了了，可以选择跳过它
# 现在可以添加⼀一个Conv layer了了
model.add(Convolution1D(nb_filter = nb_filter, filter_length = filter_length, border_mode = 'valid',
                        activation = 'relu', subsample_length = 1))
# 后⾯面跟⼀一个MaxPooling
model.add(MaxPooling1D(pool_length = model.output_shape[1]))
# Pool出来的结果 就是类似于⼀一堆的⼩小vec
# 把他们粗暴暴的flatten⼀一下（就是横着 连成⼀一起）
model.add(Flatten())
# 接下来就是简单的MLP了了
# 在Keras⾥里里，普通的layer⽤用Dense表示
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))
# 最终层
model.add(Dense(1))
model.add(Activation('sigmoid'))
# 如果⾯面对的是时间序列列的话，
# 这⾥里里也是可以把layers都换成LSTM
# compile
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
# 跑起来
model.fit(X_train, y_train,
          batch_size = batch_size,
          nb_epoch = nb_epoch,
          validation_data = (X_test, y_test))
# 这⾥里里有个fault啊，不不可以拿testset做validation的
# 这只是简单的做个示例例，为了了跑起来⽅方便便
# 2. 2D CNN
# 我们的input 应该是⼀一个list of M*N matrixs
# 但是注意⼀一下，我们需要稍微reshape⼀一下，把每个matrix⽤用⼀一个⼀一维的[]包起来
# 这就等于 我们把input变成 list of lists, 每个list包含⼀一个M*N Matrix
x_train = X_train.reshape(X_train.shape[0], 1,
                          X_train.shape[1], X_train.shape[2])
# cast⼀一下type，以防Numpy冲突
x_train = x_train.astype('float32')
# 接着，还是⼀一样：
model = Sequential()
# n_filter, ⼀一共⼏几个filter
# n_conv，每个filter的size
model.add(Convolution2D(n_filter, n_conv, n_conv,
                        border_mode = border_mode,
                        input_shape = (1, x_axis,
                                       y_axis)))
model.add(Activation('relu'))
model.add(Convolution2D(n_filter, n_conv, n_conv))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size = (n_pool, n_pool)))
model.add(Dropout(0.25))
model.add(Flatten())

案例

从每日新闻中预测金融市场变化

数据获取
/r/worldnews

DJIA（道琼斯指数）

RedditNews（经济新闻）

Combine
Date, Text, Label

W2V模型的Pretrain:
1.GoogleNews.bin (https://code.google.com/archive/p/word2vec/)
2.RedditComments (https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/)
3. 用我提供的dataset里的新闻文本直接当场train
4. 我预先训练好的 reddit w2v model

普通神经网络：

RNN：

RNN的目的是让有sequential关系的信息得到考虑。
什么是sequential关系？
就是信息在时间上的前后关系。
相比于普通神经网络：

每个时间点中的S计算

这个神经元最终的输出，
基于最后一个S

简单来说，对于t=5来说，其实就相当于把一个神经元拉伸成五个
换句句话说，S就是我们所说的记忆（因为把t从1-5的信息都记录下来了）

由前文可见，RNN可以带上记忆。
假设，一个『生成下一个单词』的例子：
『这顿饭真好』——>『吃』
很明显，我们只要前5个字就能猜到下一个字是啥了了
However，
如果我问你，『穿山甲说了什么？』
你能回答嘛？
(credit to 暴走漫画)

LSTM

　　　　　　　　　　　　　　　　　　　　　　　　RNN

　　　　　　　　　　　　　　　　　　　　　　　　LSTM

LSTM中最重要的就是这个Cell State，它一路向下，贯穿这个时间线，代表了记忆的纽带。它会被XOR和AND运算符搞一搞，来更更新记忆

而控制信息的增加和减少的，就是靠这些阀门：Gate

阀门嘛，就是输出一个1于0之间的值：

1 代表，把这一趟的信息都记着

0 代表，这一趟的信息可以忘记了

下面我们来模拟一遍信息在LSTM里跑跑~

第一步：忘记门
来决定我们该忘记什么信息

它把上一次的状态ht-1和这一次的输入xt相比较
通过gate输出一个0到1的值（就像是个activation function一样），
1 代表：给我记着！
0 代表：快快忘记！

第二步：记忆门
哪些该记住
这个门比较复杂，分两步：
第一步，用sigmoid决定什什么信息需要被我们更新（忘记旧的）
第二部，用Tanh造一个新的Cell State（更更新后的cell state）

第三步：更新门
把老cell state更新为新cell state
用XOR和AND这样的门来更新我们的cell state：

第四步：输出门
由记忆来决定输出什什么值
我们的Cell State已经被更更新，
于是我们通过这个记忆纽带，来决定我们的输出：
（这里的Ot类似于我们刚刚RNN里里直接一步跑出来的output）

案例

题目原型：What’s Next？

可以用在不不同的维度上：

维度1：下一个字母是什么？

维度2：下一个单词是什么？

维度3：下一个句子是什么？

维度N：下一个图片/音符/….是什么？

用RNN做文本生成

举个小小的例子，来看看LSTM是怎么玩的

我们这里用温斯顿丘吉尔的人物传记作为我们的学习语料。

第一步，一样，先导入各种库

import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5105)
/usr/local/lib/python3.5/dist-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
  warnings.warn(warn)

接下来，我们把文本读入

raw_text = open('../input/Winston_Churchil.txt').read()
raw_text = raw_text.lower()

既然我们是以每个字母为层级，字母总共才26个，所以我们可以很方便的用One-Hot来编码出所有的字母（当然，可能还有些标点符号和其他noise）

chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

　　我们看到，全部的chars：

chars

['\n',
 ' ',
 '!',
 '#',
 '$',
 '%',
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 '@',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '‘',
 '’',
 '“',
 '”',
 '\ufeff']

一共有：

len(chars)

61
同时，我们的原文本一共有:

len(raw_text)

我们这里简单的文本预测就是，给了前置的字母以后，下一个字母是谁？

比如，Winsto, 给出 n Britai 给出 n

构造训练测试集

我们需要把我们的raw text变成可以用来训练的x,y:

x 是前置字母们 y 是后一个字母

seq_length = 100
x = []
y = []
for i in range(0, len(raw_text) - seq_length):
    given = raw_text[i:i + seq_length]
    predict = raw_text[i + seq_length]
    x.append([char_to_int[char] for char in given])
    y.append(char_to_int[predict])

　　我们可以看看我们做好的数据集的长相：

print(x[:3])
print(y[:3])

[[60, 45, 47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44], [45, 47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44, 35], [47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44, 35, 1]]
[35, 1, 30]

此刻，楼上这些表达方式，类似就是一个词袋，或者说 index。

接下来我们做两件事：

我们已经有了一个input的数字表达（index），我们要把它变成LSTM需要的数组格式： [样本数，时间步伐，特征]
第二，对于output，我们在Word2Vec里学过，用one-hot做output的预测可以给我们更好的效果，相对于直接预测一个准确的y数值的话。

n_patterns = len(x)
n_vocab = len(chars)

# 把x变成LSTM需要的样子
x = numpy.reshape(x, (n_patterns, seq_length, 1))
# 简单normal到0-1之间
x = x / float(n_vocab)
# output变成one-hot
y = np_utils.to_categorical(y)

print(x[11])
print(y[11])

[[ 0.80327869]
 [ 0.55737705]
 [ 0.70491803]
 [ 0.50819672]
 [ 0.55737705]
 [ 0.7704918 ]
 [ 0.59016393]
 [ 0.93442623]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.7704918 ]
 [ 0.55737705]
 [ 0.49180328]
 [ 0.67213115]
 [ 0.01639344]
 [ 0.78688525]
 [ 0.72131148]
 [ 0.67213115]
 [ 0.54098361]
 [ 0.62295082]
 [ 0.55737705]
 [ 0.7704918 ]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.72131148]
 [ 0.57377049]
 [ 0.01639344]
 [ 0.57377049]
 [ 0.72131148]
 [ 0.7704918 ]
 [ 0.80327869]
 [ 0.81967213]
 [ 0.70491803]
 [ 0.55737705]
 [ 0.14754098]
 [ 0.01639344]
 [ 0.50819672]
 [ 0.8852459 ]
 [ 0.01639344]
 [ 0.7704918 ]
 [ 0.62295082]
 [ 0.52459016]
 [ 0.60655738]
 [ 0.49180328]
 [ 0.7704918 ]
 [ 0.54098361]
 [ 0.01639344]
 [ 0.60655738]
 [ 0.49180328]
 [ 0.7704918 ]
 [ 0.54098361]
 [ 0.62295082]
 [ 0.70491803]
 [ 0.59016393]
 [ 0.01639344]
 [ 0.54098361]
 [ 0.49180328]
 [ 0.83606557]
 [ 0.62295082]
 [ 0.78688525]
 [ 0.        ]
 [ 0.        ]
 [ 0.80327869]
 [ 0.60655738]
 [ 0.62295082]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.55737705]
 [ 0.50819672]
 [ 0.72131148]
 [ 0.72131148]
 [ 0.6557377 ]
 [ 0.01639344]
 [ 0.62295082]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.57377049]
 [ 0.72131148]
 [ 0.7704918 ]
 [ 0.01639344]
 [ 0.80327869]
 [ 0.60655738]
 [ 0.55737705]
 [ 0.01639344]
 [ 0.81967213]
 [ 0.78688525]
 [ 0.55737705]
 [ 0.01639344]
 [ 0.72131148]
 [ 0.57377049]
 [ 0.01639344]
 [ 0.49180328]
 [ 0.70491803]
 [ 0.8852459 ]
 [ 0.72131148]
 [ 0.70491803]
 [ 0.55737705]
 [ 0.01639344]
 [ 0.49180328]
 [ 0.70491803]]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  0.  0.  0.  0.  0.]

模型建造

LSTM模型构建

model = Sequential()
model.add(LSTM(256, input_shape=(x.shape[1], x.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

跑模型

model.fit(x, y, nb_epoch=50, batch_size=4096)

Epoch 1/50
276730/276730 [==============================] - 197s - loss: 3.1120   
Epoch 2/50
276730/276730 [==============================] - 197s - loss: 3.0227   
Epoch 3/50
276730/276730 [==============================] - 197s - loss: 2.9910   
Epoch 4/50
276730/276730 [==============================] - 197s - loss: 2.9337   
Epoch 5/50
276730/276730 [==============================] - 197s - loss: 2.8971   
Epoch 6/50
276730/276730 [==============================] - 197s - loss: 2.8784   
Epoch 7/50
276730/276730 [==============================] - 197s - loss: 2.8640   
Epoch 8/50
276730/276730 [==============================] - 197s - loss: 2.8516   
Epoch 9/50
276730/276730 [==============================] - 197s - loss: 2.8384   
Epoch 10/50
276730/276730 [==============================] - 197s - loss: 2.8254   
Epoch 11/50
276730/276730 [==============================] - 197s - loss: 2.8133   
Epoch 12/50
276730/276730 [==============================] - 197s - loss: 2.8032   
Epoch 13/50
276730/276730 [==============================] - 197s - loss: 2.7913   
Epoch 14/50
276730/276730 [==============================] - 197s - loss: 2.7831   
Epoch 15/50
276730/276730 [==============================] - 197s - loss: 2.7744   
Epoch 16/50
276730/276730 [==============================] - 197s - loss: 2.7672   
Epoch 17/50
276730/276730 [==============================] - 197s - loss: 2.7601   
Epoch 18/50
276730/276730 [==============================] - 197s - loss: 2.7540   
Epoch 19/50
276730/276730 [==============================] - 197s - loss: 2.7477   
Epoch 20/50
276730/276730 [==============================] - 197s - loss: 2.7418   
Epoch 21/50
276730/276730 [==============================] - 197s - loss: 2.7360   
Epoch 22/50
276730/276730 [==============================] - 197s - loss: 2.7296   
Epoch 23/50
276730/276730 [==============================] - 197s - loss: 2.7238   
Epoch 24/50
276730/276730 [==============================] - 197s - loss: 2.7180   
Epoch 25/50
276730/276730 [==============================] - 197s - loss: 2.7113   
Epoch 26/50
276730/276730 [==============================] - 197s - loss: 2.7055   
Epoch 27/50
276730/276730 [==============================] - 197s - loss: 2.7000   
Epoch 28/50
276730/276730 [==============================] - 197s - loss: 2.6934   
Epoch 29/50
276730/276730 [==============================] - 197s - loss: 2.6859   
Epoch 30/50
276730/276730 [==============================] - 197s - loss: 2.6800   
Epoch 31/50
276730/276730 [==============================] - 197s - loss: 2.6741   
Epoch 32/50
276730/276730 [==============================] - 197s - loss: 2.6669   
Epoch 33/50
276730/276730 [==============================] - 197s - loss: 2.6593   
Epoch 34/50
276730/276730 [==============================] - 197s - loss: 2.6529   
Epoch 35/50
276730/276730 [==============================] - 197s - loss: 2.6461   
Epoch 36/50
276730/276730 [==============================] - 197s - loss: 2.6385   
Epoch 37/50
276730/276730 [==============================] - 197s - loss: 2.6320   
Epoch 38/50
276730/276730 [==============================] - 197s - loss: 2.6249   
Epoch 39/50
276730/276730 [==============================] - 197s - loss: 2.6187   
Epoch 40/50
276730/276730 [==============================] - 197s - loss: 2.6110   
Epoch 41/50
276730/276730 [==============================] - 192s - loss: 2.6039   
Epoch 42/50
276730/276730 [==============================] - 141s - loss: 2.5969   
Epoch 43/50
276730/276730 [==============================] - 140s - loss: 2.5909   
Epoch 44/50
276730/276730 [==============================] - 140s - loss: 2.5843   
Epoch 45/50
276730/276730 [==============================] - 140s - loss: 2.5763   
Epoch 46/50
276730/276730 [==============================] - 140s - loss: 2.5697   
Epoch 47/50
276730/276730 [==============================] - 141s - loss: 2.5635   
Epoch 48/50
276730/276730 [==============================] - 140s - loss: 2.5575   
Epoch 49/50
276730/276730 [==============================] - 140s - loss: 2.5496   
Epoch 50/50
276730/276730 [==============================] - 140s - loss: 2.5451

Out[11]:

<keras.callbacks.History at 0x7fb6121b6e48>

我们来写个程序，看看我们训练出来的LSTM的效果：

def predict_next(input_array):
    x = numpy.reshape(input_array, (1, seq_length, 1))
    x = x / float(n_vocab)
    y = model.predict(x)
    return y

def string_to_index(raw_input):
    res = []
    for c in raw_input[(len(raw_input)-seq_length):]:
        res.append(char_to_int[c])
    return res

def y_to_char(y):
    largest_index = y.argmax()
    c = int_to_char[largest_index]
    return c

好，写成一个大程序：

def generate_article(init, rounds=200):
    in_string = init.lower()
    for i in range(rounds):
        n = y_to_char(predict_next(string_to_index(in_string)))
        in_string += n
    return in_string

init = 'His object in coming to New York was to engage officers for that service. He came at an opportune moment'
article = generate_article(init)
print(article)

his object in coming to new york was to engage officers for that service. he came at an opportune moment th the toote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soo

用RNN做文本生成

举个小小的例子，来看看LSTM是怎么玩的

我们这里不再用char级别，我们用word级别来做。

第一步，一样，先导入各种库

import os
import numpy as np
import nltk
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from gensim.models.word2vec import Word2Vec

　　接下来，我们把文本读入

raw_text = ''
for file in os.listdir("../input/"):
    if file.endswith(".txt"):
        raw_text += open("../input/"+file, errors='ignore').read() + '\n\n'
# raw_text = open('../input/Winston_Churchil.txt').read()
raw_text = raw_text.lower()
sentensor = nltk.data.load('tokenizers/punkt/english.pickle')        
sents = sentensor.tokenize(raw_text)
corpus = []
for sen in sents:
    corpus.append(nltk.word_tokenize(sen))

print(len(corpus))
print(corpus[:3])

91007
[['\ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'great', 'expectations', ',', 'by', 'charles', 'dickens', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.'], ['you', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'title', ':', 'great', 'expectations', 'author', ':', 'charles', 'dickens', 'posting', 'date', ':', 'august', '20', ',', '2008', '[', 'ebook', '#', '1400', ']', 'release', 'date', ':', 'july', ',', '1998', 'last', 'updated', ':', 'september', '25', ',', '2016', 'language', ':', 'english', 'character', 'set', 'encoding', ':', 'utf-8', '***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'great', 'expectations', '***', 'produced', 'by', 'an', 'anonymous', 'volunteer', 'great', 'expectations', '[', '1867', 'edition', ']', 'by', 'charles', 'dickens', '[', 'project', 'gutenberg', 'editor’s', 'note', ':', 'there', 'is', 'also', 'another', 'version', 'of', 'this', 'work', 'etext98/grexp10.txt', 'scanned', 'from', 'a', 'different', 'edition', ']', 'chapter', 'i', 'my', 'father’s', 'family', 'name', 'being', 'pirrip', ',', 'and', 'my', 'christian', 'name', 'philip', ',', 'my', 'infant', 'tongue', 'could', 'make', 'of', 'both', 'names', 'nothing', 'longer', 'or', 'more', 'explicit', 'than', 'pip', '.'], ['so', ',', 'i', 'called', 'myself', 'pip', ',', 'and', 'came', 'to', 'be', 'called', 'pip', '.']]

好，w2v乱炖：

w2v_model = Word2Vec(corpus, size=128, window=5, min_count=5, workers=4)

　　可以了

w2v_model['office']

array([-0.01398709,  0.15975526,  0.03589381, -0.4449192 ,  0.365403  ,
        0.13376504,  0.78731823,  0.01640314, -0.29723561, -0.21117583,
        0.13451998, -0.65348488,  0.06038611, -0.02000343,  0.05698346,
        0.68013376,  0.19010596,  0.56921762,  0.66904438, -0.08069923,
       -0.30662233,  0.26082459, -0.74816126, -0.41383636, -0.56303871,
       -0.10834043, -0.10635001, -0.7193433 ,  0.29722607, -0.83104628,
        1.11914253, -0.34119046, -0.39490014, -0.34709939, -0.00583572,
        0.17824887,  0.43295503,  0.11827419, -0.28707108, -0.02838829,
        0.02565269,  0.10328653, -0.19100265, -0.24102989,  0.23023468,
        0.51493132,  0.34759828,  0.05510307,  0.20583512, -0.17160387,
       -0.10351282,  0.19884749, -0.03935663, -0.04055062,  0.38888735,
       -0.02003323, -0.16577065, -0.15858875,  0.45083243, -0.09268586,
       -0.91098118,  0.16775337,  0.3432925 ,  0.2103184 , -0.42439541,
        0.26097715, -0.10714807,  0.2415273 ,  0.2352251 , -0.21662289,
       -0.13343927,  0.11787982, -0.31010333,  0.21146733, -0.11726214,
       -0.65574747,  0.04007725, -0.12032496, -0.03468512,  0.11063002,
        0.33530036, -0.64098376,  0.34013858, -0.08341357, -0.54826909,
        0.0723564 , -0.05169795, -0.19633259,  0.08620321,  0.05993884,
       -0.14693044, -0.40531522, -0.07695422,  0.2279872 , -0.12342903,
       -0.1919964 , -0.09589464,  0.4433476 ,  0.38304719,  1.0319351 ,
        0.82628119,  0.3677327 ,  0.07600326,  0.08538571, -0.44261214,
       -0.10997667, -0.03823839,  0.40593523,  0.32665277, -0.67680383,
        0.32504487,  0.4009226 ,  0.23463745, -0.21442334,  0.42727917,
        0.19593567, -0.10731711, -0.01080817, -0.14738144,  0.15710345,
       -0.01099576,  0.35833639,  0.16394758, -0.10431164, -0.28202233,
        0.24488974,  0.69327635, -0.29230621], dtype=float32)

接下来，其实我们还是以之前的方式来处理我们的training data，把源数据变成一个长长的x，好让LSTM学会predict下一个单词：

raw_input = [item for sublist in corpus for item in sublist]
len(raw_input)

raw_input[12]

'ebook'

text_stream = []
vocab = w2v_model.vocab
for word in raw_input:
    if word in vocab:
        text_stream.append(word)
len(text_stream)

我们这里的文本预测就是，给了前面的单词以后，下一个单词是谁？

比如，hello from the other, 给出 side

构造训练测试集

我们需要把我们的raw text变成可以用来训练的x,y:

x 是前置字母们 y 是后一个字母

seq_length = 10
x = []
y = []
for i in range(0, len(text_stream) - seq_length):

    given = text_stream[i:i + seq_length]
    predict = text_stream[i + seq_length]
    x.append(np.array([w2v_model[word] for word in given]))
    y.append(w2v_model[predict])

我们可以看看我们做好的数据集的长相：

print(x[10])
print(y[10])

[[-0.02218935  0.04861801 -0.03001036 ...,  0.07096259  0.16345282
  -0.18007144]
 [ 0.1663752   0.67981642  0.36581406 ...,  1.03355932  0.94110376
  -1.02763569]
 [-0.12611888  0.75773817  0.00454156 ...,  0.80544478  2.77890372
  -1.00110698]
 ..., 
 [ 0.34167829 -0.28152692 -0.12020591 ...,  0.19967555  1.65415502
  -1.97690392]
 [-0.66742641  0.82389861 -1.22558379 ...,  0.12269551  0.30856156
   0.29964617]
 [-0.17075984  0.0066567  -0.3894183  ...,  0.23729582  0.41993639
  -0.12582727]]
[ 0.18125793 -1.72401989 -0.13503326 -0.42429626  1.40763748 -2.16775346
  2.26685596 -2.03301549  0.42729807 -0.84830129  0.56945151  0.87243706
  3.01571465 -0.38155749 -0.99618471  1.1960727   1.93537641  0.81187075
 -0.83017075 -3.18952608  0.48388934 -0.03766865 -1.68608069 -1.84907544
 -0.95259917  0.49039507 -0.40943271  0.12804921  1.35876858  0.72395176
  1.43591952 -0.41952157  0.38778016 -0.75301784 -2.5016799  -0.85931653
 -1.39363682  0.42932403  1.77297652  0.41443667 -1.30974782 -0.08950856
 -0.15183811 -1.59824061 -1.58920395  1.03765178  2.07559252  2.79692245
  1.11855054 -0.25542653 -1.04980111 -0.86929852 -1.26279402 -1.14124119
 -1.04608357  1.97869778 -2.23650813 -2.18115139 -0.26534671  0.39432198
 -0.06398458 -1.02308178  1.43372631 -0.02581184 -0.96472031 -3.08931994
 -0.67289352  1.06766248 -1.95796657  1.40857184  0.61604798 -0.50270212
 -2.33530831  0.45953822  0.37867084 -0.56957626 -1.90680516 -0.57678169
  0.50550407 -0.30320352  0.19682285  1.88185465 -1.40448165 -0.43952951
  1.95433044  2.07346153  0.22390689 -0.95107335 -0.24579825 -0.21493609
  0.66570002 -0.59126669 -1.4761591   0.86431485  0.36701021  0.12569368
  1.65063572  2.048352    1.81440067 -1.36734581  2.41072559  1.30975604
 -0.36556485 -0.89859813  1.28804696 -2.75488496  1.5667206  -1.75327337
  0.60426879  1.77851915 -0.32698369  0.55594021  2.01069188 -0.52870172
 -0.39022744 -1.1704396   1.28902853 -0.89315164  1.41299319  0.43392688
 -2.52578211 -1.13480854 -1.05396986 -0.85470092  0.6618616   1.23047733
 -0.28597715 -2.35096407]

print(len(x))
print(len(y))
print(len(x[12]))
print(len(x[12][0]))
print(len(y[12]))

x = np.reshape(x, (-1, seq_length, 128))
y = np.reshape(y, (-1,128))

接下来我们做两件事：

我们已经有了一个input的数字表达（w2v），我们要把它变成LSTM需要的数组格式： [样本数，时间步伐，特征]
第二，对于output，我们直接用128维的输出

模型建造

LSTM模型构建

model = Sequential()
model.add(LSTM(256, dropout_W=0.2, dropout_U=0.2, input_shape=(seq_length, 128)))
model.add(Dropout(0.2))
model.add(Dense(128, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')

跑模型

model.fit(x, y, nb_epoch=50, batch_size=4096)

Epoch 1/50
2058743/2058743 [==============================] - 150s - loss: 0.6839   
Epoch 2/50
2058743/2058743 [==============================] - 150s - loss: 0.6670   
Epoch 3/50
2058743/2058743 [==============================] - 150s - loss: 0.6625   
Epoch 4/50
2058743/2058743 [==============================] - 150s - loss: 0.6598   
Epoch 5/50
2058743/2058743 [==============================] - 150s - loss: 0.6577   
Epoch 6/50
2058743/2058743 [==============================] - 150s - loss: 0.6562   
Epoch 7/50
2058743/2058743 [==============================] - 150s - loss: 0.6549   
Epoch 8/50
2058743/2058743 [==============================] - 150s - loss: 0.6537   
Epoch 9/50
2058743/2058743 [==============================] - 150s - loss: 0.6527   
Epoch 10/50
2058743/2058743 [==============================] - 150s - loss: 0.6519   
Epoch 11/50
2058743/2058743 [==============================] - 150s - loss: 0.6512   
Epoch 12/50
2058743/2058743 [==============================] - 150s - loss: 0.6506   
Epoch 13/50
2058743/2058743 [==============================] - 150s - loss: 0.6500   
Epoch 14/50
2058743/2058743 [==============================] - 150s - loss: 0.6496   
Epoch 15/50
2058743/2058743 [==============================] - 150s - loss: 0.6492   
Epoch 16/50
2058743/2058743 [==============================] - 150s - loss: 0.6488   
Epoch 17/50
2058743/2058743 [==============================] - 151s - loss: 0.6485   
Epoch 18/50
2058743/2058743 [==============================] - 150s - loss: 0.6482   
Epoch 19/50
2058743/2058743 [==============================] - 150s - loss: 0.6480   
Epoch 20/50
2058743/2058743 [==============================] - 150s - loss: 0.6477   
Epoch 21/50
2058743/2058743 [==============================] - 150s - loss: 0.6475   
Epoch 22/50
2058743/2058743 [==============================] - 150s - loss: 0.6473   
Epoch 23/50
2058743/2058743 [==============================] - 150s - loss: 0.6471   
Epoch 24/50
2058743/2058743 [==============================] - 150s - loss: 0.6470   
Epoch 25/50
2058743/2058743 [==============================] - 150s - loss: 0.6468   
Epoch 26/50
2058743/2058743 [==============================] - 150s - loss: 0.6466   
Epoch 27/50
2058743/2058743 [==============================] - 150s - loss: 0.6464   
Epoch 28/50
2058743/2058743 [==============================] - 150s - loss: 0.6463   
Epoch 29/50
2058743/2058743 [==============================] - 150s - loss: 0.6462   
Epoch 30/50
2058743/2058743 [==============================] - 150s - loss: 0.6461   
Epoch 31/50
2058743/2058743 [==============================] - 150s - loss: 0.6460   
Epoch 32/50
2058743/2058743 [==============================] - 150s - loss: 0.6458   
Epoch 33/50
2058743/2058743 [==============================] - 150s - loss: 0.6458   
Epoch 34/50
2058743/2058743 [==============================] - 150s - loss: 0.6456   
Epoch 35/50
2058743/2058743 [==============================] - 150s - loss: 0.6456   
Epoch 36/50
2058743/2058743 [==============================] - 150s - loss: 0.6455   
Epoch 37/50
2058743/2058743 [==============================] - 150s - loss: 0.6454   
Epoch 38/50
2058743/2058743 [==============================] - 150s - loss: 0.6453   
Epoch 39/50
2058743/2058743 [==============================] - 150s - loss: 0.6452   
Epoch 40/50
2058743/2058743 [==============================] - 150s - loss: 0.6452   
Epoch 41/50
2058743/2058743 [==============================] - 150s - loss: 0.6451   
Epoch 42/50
2058743/2058743 [==============================] - 150s - loss: 0.6450   
Epoch 43/50
2058743/2058743 [==============================] - 150s - loss: 0.6450   
Epoch 44/50
2058743/2058743 [==============================] - 150s - loss: 0.6449   
Epoch 45/50
2058743/2058743 [==============================] - 150s - loss: 0.6448   
Epoch 46/50
2058743/2058743 [==============================] - 150s - loss: 0.6447   
Epoch 47/50
2058743/2058743 [==============================] - 150s - loss: 0.6447   
Epoch 48/50
2058743/2058743 [==============================] - 150s - loss: 0.6446   
Epoch 49/50
2058743/2058743 [==============================] - 150s - loss: 0.6446   
Epoch 50/50
2058743/2058743 [==============================] - 150s - loss: 0.6445

Out[130]:

<keras.callbacks.History at 0x7f6ed8816a58>

我们来写个程序，看看我们训练出来的LSTM的效果：

def predict_next(input_array):
    x = np.reshape(input_array, (-1,seq_length,128))
    y = model.predict(x)
    return y

def string_to_index(raw_input):
    raw_input = raw_input.lower()
    input_stream = nltk.word_tokenize(raw_input)
    res = []
    for word in input_stream[(len(input_stream)-seq_length):]:
        res.append(w2v_model[word])
    return res

def y_to_word(y):
    word = w2v_model.most_similar(positive=y, topn=1)
    return word

　　好，写成一个大程序：

def generate_article(init, rounds=30):
    in_string = init.lower()
    for i in range(rounds):
        n = y_to_word(predict_next(string_to_index(in_string)))
        in_string += ' ' + n[0][0]
    return in_string

init = 'Language Models allow us to measure how likely a sentence is, which is an important for Machine'
article = generate_article(init)
print(article)

language models allow us to measure how likely a sentence is, which is an important for machine engagement . to-day good-for-nothing fit job job job job job . i feel thing job job job ; thing really done certainly job job ; but i need not say

posted on 2018-11-01 23:19 医疗兵皮特儿阅读(341) 评论(0) 收藏举报

刷新页面返回顶部

006-深度学习与NLP简单应用

用RNN做文本生成

构造训练测试集

模型建造

跑模型

用RNN做文本生成

构造训练测试集

模型建造

跑模型

导航

公告