NLP（二十一）根据已有文本LSTM自动生成文本

原文链接：http://www.one2know.cn/nlp21/

根据已有文本LSTM自动生成文本

原理
与股票预测类似，用前面的n个字符预测下一个字符
https://www.cnblogs.com/peng8098/p/keras_5.html
代码

from __future__ import print_function
import numpy as np
import random
import sys

path = r'shakespeare_final.txt'
text = open(path).read().lower() # 打开文档 读成字符串 然后都变小写
characters = sorted(list(set(text))) # 去掉重复字符 方便下面编码
print('corpus length:',len(text))
print('total chars:',len(characters))

char2indices = dict((c,i) for i,c in enumerate(characters)) # 字符(字母等)=>索引(数字)
indices2char = dict((i,c) for i,c in enumerate(characters)) # 索引(数字)=>字符(字母等)

maxlen = 40 # 40个字符长度预测下一个字符
step = 3 # 一次预测3个
sentences = []
next_chars = []
for i in range(0,len(text)-maxlen,step):
    sentences.append(text[i:i+maxlen])
    next_chars.append(text[i+maxlen])
print('nb sentences:',len(sentences)) # 40个字符串作为特征句子的个数 即训练数据大小

## 构造数据集 类似one-hot编码
X = np.zeros((len(sentences),maxlen,len(characters)),dtype=np.bool)
y = np.zeros((len(sentences),len(characters)),dtype=np.bool)
for i,sentence in enumerate(sentences):
    for t,char in enumerate(sentence):
        X[i,t,char2indices[char]] = 1
    y[i,char2indices[next_chars[i]]] = 1

# 构建神经网路
from keras.models import Sequential
from keras.layers import Dense,LSTM,Activation,Dropout
from keras.optimizers import RMSprop
model = Sequential()
model.add(LSTM(128,input_shape=(maxlen,len(characters))))
model.add(Dense(len(characters)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',optimizer=RMSprop(lr=0.01))
print(model.summary())

def pred_indices(preds,metric=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / metric
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probs = np.random.multinomial(1,preds,1)
    return np.argmax(probs)

for iteration in range(1,30): # 便于观察每一轮的训练结构
    print('-' * 40)
    print('Iteration',iteration)
    model.fit(X,y,batch_size=128,epochs=1)
    start_index = random.randint(0,len(text)-maxlen-1)
    for diversity in [0.2,0.7,1.2]:
        print('\n----- diversity:',diversity)
        generated = ''
        sentence = text[start_index:start_index+maxlen]
        generated += sentence
        print('----- Generating with seed: "'+sentence+'"')
        sys.stdout.write(generated)
        for i in range(400):
            x = np.zeros((1,maxlen,len(characters)))
            for t,char in enumerate(sentence): # 数字索引=>字母
                x[0,t,char2indices[char]] = 1
            preds = model.predict(x,verbose=0)[0]
            next_index = pred_indices(preds,diversity)
            pred_char = indices2char[next_index]
            generated += pred_char
            sentence = sentence[1:] + pred_char
            sys.stdout.write(pred_char)
            sys.stdout.flush()
        print('\nOne combination completed \n')

输出：

corpus length: 581432
total chars: 61
nb sentences: 193798
Using TensorFlow backend.
WARNING:tensorflow:From D:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 128)               97280     
_________________________________________________________________
dense_1 (Dense)              (None, 61)                7869      
_________________________________________________________________
activation_1 (Activation)    (None, 61)                0         
=================================================================
Total params: 105,149
Trainable params: 105,149
Non-trainable params: 0
_________________________________________________________________
None
----------------------------------------
Iteration 1
WARNING:tensorflow:From D:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/1
2019-07-15 17:04:03.721908: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-07-15 17:04:04.438003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.64GiB
2019-07-15 17:04:04.438676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-07-15 17:04:07.352274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-15 17:04:07.352543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-07-15 17:04:07.352701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-07-15 17:04:07.357455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1386 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0, compute capability: 5.0)
2019-07-15 17:04:08.415227: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally

   128/193798 [..............................] - ETA: 2:16:56 - loss: 4.1095
   256/193798 [..............................] - ETA: 1:09:23 - loss: 3.6938
   384/193798 [..............................] - ETA: 46:52 - loss: 3.8312 
   。。。

posted @ 2019-07-16 15:17 鹏懿如斯阅读(1263) 评论(0) 编辑收藏举报

刷新页面返回顶部

鹏懿如斯

NLP（二十一）根据已有文本LSTM自动生成文本

根据已有文本LSTM自动生成文本

公告