01-NLP-04-02用RNN做文本生成RNN

不用one-hot来表示输入x,是因为想要用word2vec

将每个单词得到vector,将每个vector拼接成一个sequence。[[w1],[w2],[w3]]

用RNN做文本生成

举个小小的例子,来看看LSTM是怎么玩的

我们这里不再用char级别,我们用word级别来做。

第一步,一样,先导入各种库

In [118]:
import os
import numpy as np
import nltk
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from gensim.models.word2vec import Word2Vec

接下来,我们把文本读入

In [119]:
raw_text = ''
#读入一堆数据。上下文对照是根据句子跟句子之间。需要先分割好句子 for file in os.listdir("../input/"): if file.endswith(".txt"): raw_text += open("../input/"+file, errors='ignore').read() + '\n\n' # raw_text = open('../input/Winston_Churchil.txt').read() raw_text = raw_text.lower() sentensor = nltk.data.load('tokenizers/punkt/english.pickle') #分割句子,将原文变为句子的集合 sents = sentensor.tokenize(raw_text) corpus = [] for sen in sents: corpus.append(nltk.word_tokenize(sen))
#将句子变为单词的集合
print(len(corpus))   #corpus为二维小数组
print(corpus[:3])
 输出结果:
91007
[['\ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'great', 'expectations', ',', 'by', 'charles', 'dickens', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.'], ['you', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'title', ':', 'great', 'expectations', 'author', ':', 'charles', 'dickens', 'posting', 'date', ':', 'august', '20', ',', '2008', '[', 'ebook', '#', '1400', ']', 'release', 'date', ':', 'july', ',', '1998', 'last', 'updated', ':', 'september', '25', ',', '2016', 'language', ':', 'english', 'character', 'set', 'encoding', ':', 'utf-8', '***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'great', 'expectations', '***', 'produced', 'by', 'an', 'anonymous', 'volunteer', 'great', 'expectations', '[', '1867', 'edition', ']', 'by', 'charles', 'dickens', '[', 'project', 'gutenberg', 'editor’s', 'note', ':', 'there', 'is', 'also', 'another', 'version', 'of', 'this', 'work', 'etext98/grexp10.txt', 'scanned', 'from', 'a', 'different', 'edition', ']', 'chapter', 'i', 'my', 'father’s', 'family', 'name', 'being', 'pirrip', ',', 'and', 'my', 'christian', 'name', 'philip', ',', 'my', 'infant', 'tongue', 'could', 'make', 'of', 'both', 'names', 'nothing', 'longer', 'or', 'more', 'explicit', 'than', 'pip', '.'], ['so', ',', 'i', 'called', 'myself', 'pip', ',', 'and', 'came', 'to', 'be', 'called', 'pip', '.']]
 

好,w2v乱炖:

In [120]:
w2v_model = Word2Vec(corpus, size=128, window=5, min_count=5, workers=4)  #单词的学习利用迁移学习,直接拿一个较好的word2vec模型
#虽然单词学习的模型用的其他语料,但是我们注重的是LSTM中对风格的学习,所以可以采用已经需练好的单词模型
In [121]:
w2v_model['office']
Out[121]:
array([-0.01398709,  0.15975526,  0.03589381, -0.4449192 ,  0.365403  ,
        0.13376504,  0.78731823,  0.01640314, -0.29723561, -0.21117583,
        0.13451998, -0.65348488,  0.06038611, -0.02000343,  0.05698346,
        0.68013376,  0.19010596,  0.56921762,  0.66904438, -0.08069923,
       -0.30662233,  0.26082459, -0.74816126, -0.41383636, -0.56303871,
       -0.10834043, -0.10635001, -0.7193433 ,  0.29722607, -0.83104628,
        1.11914253, -0.34119046, -0.39490014, -0.34709939, -0.00583572,
        0.17824887,  0.43295503,  0.11827419, -0.28707108, -0.02838829,
        0.02565269,  0.10328653, -0.19100265, -0.24102989,  0.23023468,
        0.51493132,  0.34759828,  0.05510307,  0.20583512, -0.17160387,
       -0.10351282,  0.19884749, -0.03935663, -0.04055062,  0.38888735,
       -0.02003323, -0.16577065, -0.15858875,  0.45083243, -0.09268586,
       -0.91098118,  0.16775337,  0.3432925 ,  0.2103184 , -0.42439541,
        0.26097715, -0.10714807,  0.2415273 ,  0.2352251 , -0.21662289,
       -0.13343927,  0.11787982, -0.31010333,  0.21146733, -0.11726214,
       -0.65574747,  0.04007725, -0.12032496, -0.03468512,  0.11063002,
        0.33530036, -0.64098376,  0.34013858, -0.08341357, -0.54826909,
        0.0723564 , -0.05169795, -0.19633259,  0.08620321,  0.05993884,
       -0.14693044, -0.40531522, -0.07695422,  0.2279872 , -0.12342903,
       -0.1919964 , -0.09589464,  0.4433476 ,  0.38304719,  1.0319351 ,
        0.82628119,  0.3677327 ,  0.07600326,  0.08538571, -0.44261214,
       -0.10997667, -0.03823839,  0.40593523,  0.32665277, -0.67680383,
        0.32504487,  0.4009226 ,  0.23463745, -0.21442334,  0.42727917,
        0.19593567, -0.10731711, -0.01080817, -0.14738144,  0.15710345,
       -0.01099576,  0.35833639,  0.16394758, -0.10431164, -0.28202233,
        0.24488974,  0.69327635, -0.29230621], dtype=float32)
 

接下来,其实我们还是以之前的方式来处理我们的training data,把源数据变成一个长长的x,好让LSTM学会predict下一个单词:

In [122]:
raw_input = [item for sublist in corpus for item in sublist]  #将二维的corpus平坦化flatten成一个一维数组
len(raw_input)
Out[122]:
2115170
In [123]:
raw_input[12]
Out[123]:
'ebook'
In [124]:去掉不在vocabulary中的单词,将在的连接起来
text_stream = []
vocab = w2v_model.vocab
for word in raw_input:
    if word in vocab:
        text_stream.append(word)
len(text_stream)
Out[124]:
2058753
 

我们这里的文本预测就是,给了前面的单词以后,下一个单词是谁?

比如,hello from the other, 给出 side

构造训练测试集

我们需要把我们的raw text变成可以用来训练的x,y:

x 是前置字母们 y 是后一个字母

In [125]:
seq_length = 10
x = []
y = []
for i in range(0, len(text_stream) - seq_length):

    given = text_stream[i:i + seq_length]
    predict = text_stream[i + seq_length]
    x.append(np.array([w2v_model[word] for word in given]))
    y.append(w2v_model[predict])
 

我们可以看看我们做好的数据集的长相:

In [126]:
print(x[10])
print(y[10])
 
[[-0.02218935  0.04861801 -0.03001036 ...,  0.07096259  0.16345282
  -0.18007144]
 [ 0.1663752   0.67981642  0.36581406 ...,  1.03355932  0.94110376
  -1.02763569]
 [-0.12611888  0.75773817  0.00454156 ...,  0.80544478  2.77890372
  -1.00110698]
 ..., 
 [ 0.34167829 -0.28152692 -0.12020591 ...,  0.19967555  1.65415502
  -1.97690392]
 [-0.66742641  0.82389861 -1.22558379 ...,  0.12269551  0.30856156
   0.29964617]
 [-0.17075984  0.0066567  -0.3894183  ...,  0.23729582  0.41993639
  -0.12582727]]
[ 0.18125793 -1.72401989 -0.13503326 -0.42429626  1.40763748 -2.16775346
  2.26685596 -2.03301549  0.42729807 -0.84830129  0.56945151  0.87243706
  3.01571465 -0.38155749 -0.99618471  1.1960727   1.93537641  0.81187075
 -0.83017075 -3.18952608  0.48388934 -0.03766865 -1.68608069 -1.84907544
 -0.95259917  0.49039507 -0.40943271  0.12804921  1.35876858  0.72395176
  1.43591952 -0.41952157  0.38778016 -0.75301784 -2.5016799  -0.85931653
 -1.39363682  0.42932403  1.77297652  0.41443667 -1.30974782 -0.08950856
 -0.15183811 -1.59824061 -1.58920395  1.03765178  2.07559252  2.79692245
  1.11855054 -0.25542653 -1.04980111 -0.86929852 -1.26279402 -1.14124119
 -1.04608357  1.97869778 -2.23650813 -2.18115139 -0.26534671  0.39432198
 -0.06398458 -1.02308178  1.43372631 -0.02581184 -0.96472031 -3.08931994
 -0.67289352  1.06766248 -1.95796657  1.40857184  0.61604798 -0.50270212
 -2.33530831  0.45953822  0.37867084 -0.56957626 -1.90680516 -0.57678169
  0.50550407 -0.30320352  0.19682285  1.88185465 -1.40448165 -0.43952951
  1.95433044  2.07346153  0.22390689 -0.95107335 -0.24579825 -0.21493609
  0.66570002 -0.59126669 -1.4761591   0.86431485  0.36701021  0.12569368
  1.65063572  2.048352    1.81440067 -1.36734581  2.41072559  1.30975604
 -0.36556485 -0.89859813  1.28804696 -2.75488496  1.5667206  -1.75327337
  0.60426879  1.77851915 -0.32698369  0.55594021  2.01069188 -0.52870172
 -0.39022744 -1.1704396   1.28902853 -0.89315164  1.41299319  0.43392688
 -2.52578211 -1.13480854 -1.05396986 -0.85470092  0.6618616   1.23047733
 -0.28597715 -2.35096407]
In [127]:
print(len(x))
print(len(y))
print(len(x[12]))
print(len(x[12][0]))
print(len(y[12]))
 
2058743
2058743
10
128
128
In [128]:
x = np.reshape(x, (-1, seq_length, 128))
y = np.reshape(y, (-1,128))
 

接下来我们做两件事:

  1. 我们已经有了一个input的数字表达(w2v),我们要把它变成LSTM需要的数组格式: [样本数,时间步伐,特征]

  2. 第二,对于output,我们直接用128维的输出(自己设定的单词向量的维度)

 

模型建造

LSTM模型构建

In [129]:
model = Sequential()
model.add(LSTM(256, dropout_W=0.2, dropout_U=0.2, input_shape=(seq_length, 128)))  #增加Dropout
model.add(Dropout(0.2))
model.add(Dense(128, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')
 

跑模型

In [130]:
model.fit(x, y, nb_epoch=50, batch_size=4096)
 
Epoch 1/50
2058743/2058743 [==============================] - 150s - loss: 0.6839   
Epoch 2/50
2058743/2058743 [==============================] - 150s - loss: 0.6670   
Epoch 3/50
2058743/2058743 [==============================] - 150s - loss: 0.6625   
Epoch 4/50
2058743/2058743 [==============================] - 150s - loss: 0.6598   
Epoch 5/50
2058743/2058743 [==============================] - 150s - loss: 0.6577   
Epoch 6/50
2058743/2058743 [==============================] - 150s - loss: 0.6562   
Epoch 7/50
2058743/2058743 [==============================] - 150s - loss: 0.6549   
Epoch 8/50
2058743/2058743 [==============================] - 150s - loss: 0.6537   
Epoch 9/50
2058743/2058743 [==============================] - 150s - loss: 0.6527   
Epoch 10/50
2058743/2058743 [==============================] - 150s - loss: 0.6519   
Epoch 11/50
2058743/2058743 [==============================] - 150s - loss: 0.6512   
Epoch 12/50
2058743/2058743 [==============================] - 150s - loss: 0.6506   
Epoch 13/50
2058743/2058743 [==============================] - 150s - loss: 0.6500   
Epoch 14/50
2058743/2058743 [==============================] - 150s - loss: 0.6496   
Epoch 15/50
2058743/2058743 [==============================] - 150s - loss: 0.6492   
Epoch 16/50
2058743/2058743 [==============================] - 150s - loss: 0.6488   
Epoch 17/50
2058743/2058743 [==============================] - 151s - loss: 0.6485   
Epoch 18/50
2058743/2058743 [==============================] - 150s - loss: 0.6482   
Epoch 19/50
2058743/2058743 [==============================] - 150s - loss: 0.6480   
Epoch 20/50
2058743/2058743 [==============================] - 150s - loss: 0.6477   
Epoch 21/50
2058743/2058743 [==============================] - 150s - loss: 0.6475   
Epoch 22/50
2058743/2058743 [==============================] - 150s - loss: 0.6473   
Epoch 23/50
2058743/2058743 [==============================] - 150s - loss: 0.6471   
Epoch 24/50
2058743/2058743 [==============================] - 150s - loss: 0.6470   
Epoch 25/50
2058743/2058743 [==============================] - 150s - loss: 0.6468   
Epoch 26/50
2058743/2058743 [==============================] - 150s - loss: 0.6466   
Epoch 27/50
2058743/2058743 [==============================] - 150s - loss: 0.6464   
Epoch 28/50
2058743/2058743 [==============================] - 150s - loss: 0.6463   
Epoch 29/50
2058743/2058743 [==============================] - 150s - loss: 0.6462   
Epoch 30/50
2058743/2058743 [==============================] - 150s - loss: 0.6461   
Epoch 31/50
2058743/2058743 [==============================] - 150s - loss: 0.6460   
Epoch 32/50
2058743/2058743 [==============================] - 150s - loss: 0.6458   
Epoch 33/50
2058743/2058743 [==============================] - 150s - loss: 0.6458   
Epoch 34/50
2058743/2058743 [==============================] - 150s - loss: 0.6456   
Epoch 35/50
2058743/2058743 [==============================] - 150s - loss: 0.6456   
Epoch 36/50
2058743/2058743 [==============================] - 150s - loss: 0.6455   
Epoch 37/50
2058743/2058743 [==============================] - 150s - loss: 0.6454   
Epoch 38/50
2058743/2058743 [==============================] - 150s - loss: 0.6453   
Epoch 39/50
2058743/2058743 [==============================] - 150s - loss: 0.6452   
Epoch 40/50
2058743/2058743 [==============================] - 150s - loss: 0.6452   
Epoch 41/50
2058743/2058743 [==============================] - 150s - loss: 0.6451   
Epoch 42/50
2058743/2058743 [==============================] - 150s - loss: 0.6450   
Epoch 43/50
2058743/2058743 [==============================] - 150s - loss: 0.6450   
Epoch 44/50
2058743/2058743 [==============================] - 150s - loss: 0.6449   
Epoch 45/50
2058743/2058743 [==============================] - 150s - loss: 0.6448   
Epoch 46/50
2058743/2058743 [==============================] - 150s - loss: 0.6447   
Epoch 47/50
2058743/2058743 [==============================] - 150s - loss: 0.6447   
Epoch 48/50
2058743/2058743 [==============================] - 150s - loss: 0.6446   
Epoch 49/50
2058743/2058743 [==============================] - 150s - loss: 0.6446   
Epoch 50/50
2058743/2058743 [==============================] - 150s - loss: 0.6445   
Out[130]:
<keras.callbacks.History at 0x7f6ed8816a58>
 

我们来写个测试程序,看看我们训练出来的LSTM的效果:

In [131]:
def predict_next(input_array):
    x = np.reshape(input_array, (-1,seq_length,128))
    y = model.predict(x)
    return y

def string_to_index(raw_input):
    raw_input = raw_input.lower()
    input_stream = nltk.word_tokenize(raw_input)
    res = []
    for word in input_stream[(len(input_stream)-seq_length):]:
        res.append(w2v_model[word])
    return res

def y_to_word(y):
    word = w2v_model.most_similar(positive=y, topn=1)
    return word
 

好,写成一个大程序:

In [137]:
def generate_article(init, rounds=30):
    in_string = init.lower()
    for i in range(rounds):
        n = y_to_word(predict_next(string_to_index(in_string)))
        in_string += ' ' + n[0][0]
    return in_string
In [138]:
init = 'Language Models allow us to measure how likely a sentence is, which is an important for Machine'
article = generate_article(init)
print(article)
 
language models allow us to measure how likely a sentence is, which is an important for machine engagement . to-day good-for-nothing fit job job job job job . i feel thing job job job ; thing really done certainly job job ; but i need not say
In [ ]:
 

posted on 2018-05-28 11:09  Josie_chen  阅读(1073)  评论(0编辑  收藏  举报

导航