基于LSTM模型实现文本生成
本节将利用LSTM模型对莎士比亚文集进行训练后实现文本生成。
相关数据下载地址:https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt,下载后保存在当前目录下并命名为“shakespeare.txt”
新建python3文件——text_generation_lstm
一、import 相关模块并查看版本
import matplotlib as mpl import matplotlib.pyplot as plt #%matplotlib inline import numpy as np import sklearn import pandas as pd import os import sys import time import tensorflow as tf from tensorflow import keras print(tf.__version__) print(sys.version_info) for module in mpl,np,pd,sklearn,tf,keras: print(module.__name__,module.__version__)
代码执行结果如下:
2.0.0 sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0) matplotlib 3.1.1 numpy 1.16.5 pandas 0.25.1 sklearn 0.21.3 tensorflow 2.0.0 tensorflow_core.keras 2.2.4-tf
二、读取数据集
input_filepath = './shakespeare.txt' text = open(input_filepath,'r').read() print(len(text)) print(text[0:10])
通过执行代码我们看到数据的长度以及前十个数据:
1115394
First Citi
三、数据处理
在这里我们要做的有四件事:
1、生成词汇表
2、建立映射表 字符——>id id——>字符
3、定义输入与输出 eg. 输入abcd ——> 输出bcde
1、生成词汇表
vocab = sorted(set(text)) # 利用set方法取出字符,并将重复字符去掉,sort方法用来排序 print(len(vocab)) print(vocab)
代码执行结果如下:
65 ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd',
'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
可以看到数据集中有65个字符。
2、建立映射
# 建立映射 字符-->id char2idx = {char : idx for idx,char in enumerate(vocab)} print(char2idx)
代码执行结果如下:
{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12,
'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24,
'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36,
'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48,
'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60,
'w': 61, 'x': 62, 'y': 63, 'z': 64}
# 把vocab转为numpy 即id ——> 字符 idx2char = np.array(vocab) print(idx2char)
代码执行结果如下:
['\n' ' ' '!' '$' '&' "'" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']
接下来我们要根据映射将text变为对应的idx组成的文本,代码如下:
# 将text变为对应的idx组成的文本 text_as_int = np.array([char2idx[c] for c in text]) print(text_as_int[0:10]) print(text[0:10])
通过执行代码我们可以看到text_as_int的前十行:
[18 47 56 57 58 1 15 47 58 47]
First Citi
3、定义输入与输出
接下来我们定义一个函数来实现输入与输出:
# 定义输入与输出 def split_input_target(id_text): """ 文本为abcde,则输入为abcd,四个字符对应的输出分别为:bcde,即每个输出都是输入的下一个字符 """ return id_text[0:-1],id_text[1:]
四、定义dataset
这里要将数据转为dataset以方便对数据进行预处理与训练。
1、定义dataset
# 定义dataset # 每个字符集对应的idx的dataset char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) # 定义一个句子型的dataset seg_length = 100 seq_dataset = char_dataset.batch(seg_length+1, drop_remainder=True) # 打印两个数据集的内容 for ch_id in char_dataset.take(2): print(ch_id,idx2char[ch_id.numpy()]) for seg_id in seq_dataset.take(2): print(seg_id) print(repr(''.join(idx2char[seg_id.numpy()])))
这里的seq_dataset是将字符数据集char_dataset以seq_length+1的长度批量处理的句子数据集,之所以长度为seq_length+1是因为为了与之前的建立输入输出的函数连上,以便输入与输出的长度都为seq_length。
代码执行结果如下:
tf.Tensor(18, shape=(), dtype=int32) F tf.Tensor(47, shape=(), dtype=int32) i tf.Tensor( [18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 14 43 44 53 56 43 1 61 43 1 54 56 53 41 43 43 42 1 39 52 63 1 44 59 56 58 46 43 56 6 1 46 43 39 56 1 51 43 1 57 54 43 39 49 8 0 0 13 50 50 10 0 31 54 43 39 49 6 1 57 54 43 39 49 8 0 0 18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 37 53 59 1], shape=(101,), dtype=int32) 'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ' tf.Tensor( [39 56 43 1 39 50 50 1 56 43 57 53 50 60 43 42 1 56 39 58 46 43 56 1 58 53 1 42 47 43 1 58 46 39 52 1 58 53 1 44 39 51 47 57 46 12 0 0 13 50 50 10 0 30 43 57 53 50 60 43 42 8 1 56 43 57 53 50 60 43 42 8 0 0 18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 18 47 56 57 58 6 1 63 53 59 1 49], shape=(101,), dtype=int32) 'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
2、建立输入与输出
seq_dataset = seq_dataset.map(split_input_target) for item_input, item_output in seq_dataset.take(2): print(item_input.numpy()) print(item_output.numpy())
将我们之前定义的输入输出函数应用到数据集上。
以前两个句子为例,我们查看下输入与输出。
[18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 14 43 44 53 56 43 1 61 43 1 54 56 53 41 43 43 42 1 39 52 63 1 44 59 56 58 46 43 56 6 1 46 43 39 56 1 51 43 1 57 54 43 39 49 8 0 0 13 50 50 10 0 31 54 43 39 49 6 1 57 54 43 39 49 8 0 0 18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 37 53 59] [47 56 57 58 1 15 47 58 47 64 43 52 10 0 14 43 44 53 56 43 1 61 43 1 54 56 53 41 43 43 42 1 39 52 63 1 44 59 56 58 46 43 56 6 1 46 43 39 56 1 51 43 1 57 54 43 39 49 8 0 0 13 50 50 10 0 31 54 43 39 49 6 1 57 54 43 39 49 8 0 0 18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 37 53 59 1] [39 56 43 1 39 50 50 1 56 43 57 53 50 60 43 42 1 56 39 58 46 43 56 1 58 53 1 42 47 43 1 58 46 39 52 1 58 53 1 44 39 51 47 57 46 12 0 0 13 50 50 10 0 30 43 57 53 50 60 43 42 8 1 56 43 57 53 50 60 43 42 8 0 0 18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 18 47 56 57 58 6 1 63 53 59 1] [56 43 1 39 50 50 1 56 43 57 53 50 60 43 42 1 56 39 58 46 43 56 1 58 53 1 42 47 43 1 58 46 39 52 1 58 53 1 44 39 51 47 57 46 12 0 0 13 50 50 10 0 30 43 57 53 50 60 43 42 8 1 56 43 57 53 50 60 43 42 8 0 0 18 47 56 57 58 1 15 47 58 47 64 43 52 10 0 18 47 56 57 58 6 1 63 53 59 1 49]
3、混排与batch_size
batch_size = 64 buffer_size = 10000 seq_dataset = seq_dataset.shuffle(buffer_size).batch(batch_size,drop_remainder=True)
五、定义模型
# 定义模型 vocab_size = len(vocab) embedding_dim = 256 rnn_units = 1024 def build_model(vocab_size,embedding_dim,rnn_units,batch_size): model = keras.models.Sequential([ keras.layers.Embedding(vocab_size,embedding_dim, batch_input_shape=[batch_size,None]), keras.layers.LSTM(units=rnn_units, stateful=True, recurrent_initializer="glorot_uniform", return_sequences=True), keras.layers.Dense(vocab_size) ]) return model model = build_model(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=batch_size) model.summary()
这里的Embedding层是用来将每个字符编码为长度为256的向量,因为每个句子是不定长的,所以我们的模型输入大小为[batch_size,None],其中None为句子的长度。
LSTM层中的units为经LSTM之后每个字符将会转化为长度为rnn_units的向量。return_sequence=True意味着我们的LSTM层要输出的依然为一个序列值。
代码执行结果如下:
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (64, None, 256) 16640 _________________________________________________________________ lstm (LSTM) (64, None, 1024) 5246976 _________________________________________________________________ dense (Dense) (64, None, 65) 66625 ================================================================= Total params: 5,330,241 Trainable params: 5,330,241 Non-trainable params: 0 _________________________________________________________________
在训练之前我们先通过这个未被训练过的模型来 查看下我们的输出是什么样子。
代码如下:
# 查看输出 for input_example_batch, target_example_batch in seq_dataset.take(1): example_batch_prediction = model(input_example_batch) print(example_batch_prediction.shape)
代码执行结果如下:
(64, 100, 65)
可以看到输出的句子长度为100,64为batch_size,65为每个vocab对应的概率分布。
然后我们需要根据每个vocab的概率分布进行随机采样。代码如下:
# random sampling sample_indices = tf.random.categorical(logits = example_batch_prediction[0], num_samples=1) print(sample_indices) sample_indices = tf.squeeze(sample_indices,axis=-1) print(sample_indices)
tf.random.categorical的作用即是用来随机采样的。将example_batch_prediction的第一个数据输入进去,num_sample = 1指的是只在65个概率分布中取一个值作为输出。
因输入大小为(100,65),随机 采样后的结果大小为(100,1)。然后我们再通过tf.squeeze来将输出的第二个维度去掉,即最终输出结果大小为(100,)。
具体执行结果为:
tf.Tensor( [[59] [11] [34] [10] [35] [10] [ 0] [64] [39] [53] [52] [62] [22] [35] [37] [ 9] [58] [45] [12] [21] [24] [63] [20] [ 5] [ 7] [54] [34] [43] [35] [41] [41] [15] [33] [ 9] [12] [33] [42] [55] [30] [45] [52] [25] [ 6] [53] [55] [ 2] [48] [ 6] [47] [ 3] [17] [26] [18] [56] [59] [53] [ 2] [30] [14] [ 2] [18] [33] [53] [36] [41] [64] [16] [ 5] [50] [63] [31] [19] [27] [27] [ 9] [59] [62] [23] [41] [35] [56] [40] [30] [18] [36] [62] [54] [26] [37] [ 6] [47] [52] [57] [17] [52] [35] [62] [63] [23] [58]], shape=(100, 1), dtype=int64) tf.Tensor( [59 11 34 10 35 10 0 64 39 53 52 62 22 35 37 9 58 45 12 21 24 63 20 5 7 54 34 43 35 41 41 15 33 9 12 33 42 55 30 45 52 25 6 53 55 2 48 6 47 3 17 26 18 56 59 53 2 30 14 2 18 33 53 36 41 64 16 5 50 63 31 19 27 27 9 59 62 23 41 35 56 40 30 18 36 62 54 26 37 6 47 52 57 17 52 35 62 63 23 58], shape=(100,), dtype=int64)
最后我们再打印下这些idx所对应的字符,代码如下:
# 打印输入,输出与预测对应的字符串 print("Input:",repr("".join(idx2char[input_example_batch[0]]))) print() print("Output:",repr("".join(idx2char[target_example_batch[0]]))) print() print("Predictions:",repr("".join(idx2char[sample_indices])))
代码执行结果如下:
Input: "ed Richard's royal queen.\n\nQUEEN ELIZABETH:\nO, cut my lace in sunder, that my pent heart\nMay have so" Output: "d Richard's royal queen.\n\nQUEEN ELIZABETH:\nO, cut my lace in sunder, that my pent heart\nMay have som" Predictions: "u;V:W:\nzaonxJWY3tg?ILyH'-pVeWccCU3?UdqRgnM,oq!j,i$ENFruo!RB!FUoXczD'lySGOO3uxKcWrbRFXxpNY,insEnWxyKt"
可以看到虽然我们的预测结果很差(因为还没训练),但是总体逻辑已经行得通了。
六、训练
1、自定义损失函数
在这里我们先自定义一个损失函数:
# 自定义损失函数 def loss(labels,logits): return keras.losses.sparse_categorical_crossentropy( labels,logits,from_logits=True) model.compile(optimizer='adam',loss=loss) # 查看之前例子的loss example_loss = loss(target_example_batch,example_batch_prediction) print(example_loss.shape) print(example_loss.numpy().mean())
通过执行代码我们可以看到之前例子的损失函数大小:
(64, 100)
4.1747465
2、模型保存
我们定义一个callback函数以保存训练模型。
# 模型保存与训练 output_dir = "./text_generation_lstm_checkpoints" if not os.path.exists(output_dir): os.mkdir(output_dir) checkpoint_prefix = os.path.join(output_dir,'ckpt_{epoch}') checkpoint_callback = keras.callbacks.ModelCheckpoint( filepath=checkpoint_prefix, save_weights_only=True)
3、训练
epochs = 10 history = model.fit(seq_dataset,epochs = epochs, callbacks=[checkpoint_callback])
训练过程如下:
Epoch 1/10 172/172 [==============================] - 775s 5s/step - loss: 2.5823 Epoch 2/10 172/172 [==============================] - 772s 4s/step - loss: 1.8873 Epoch 3/10 172/172 [==============================] - 1961s 11s/step - loss: 1.6376 Epoch 4/10 172/172 [==============================] - 777s 5s/step - loss: 1.5030 Epoch 5/10 172/172 [==============================] - 3035s 18s/step - loss: 1.4225 Epoch 6/10 172/172 [==============================] - 730s 4s/step - loss: 1.3668 Epoch 7/10 172/172 [==============================] - 662s 4s/step - loss: 1.3220 Epoch 8/10 172/172 [==============================] - 665s 4s/step - loss: 1.2838 Epoch 9/10 172/172 [==============================] - 677s 4s/step - loss: 1.2490 Epoch 10/10 172/172 [==============================] - 673s 4s/step - loss: 1.2137
训练完毕后保存最新的模型:
# 查看最新保存的模型 tf.train.latest_checkpoint(output_dir)
代码执行结果如下:
'./text_generation_lstm_checkpoints\\ckpt_10'
七、文本生成
我们首先再定义一个模型来使用最近保存的模型权重值。
model2 = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1) model2.load_weights(tf.train.latest_checkpoint(output_dir)) # 定义model2的输入shape model2.build(tf.TensorShape([1,None])) model2.summary()
代码执行结果如下:
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (1, None, 256) 16640 _________________________________________________________________ lstm_1 (LSTM) (1, None, 1024) 5246976 _________________________________________________________________ dense_1 (Dense) (1, None, 65) 66625 ================================================================= Total params: 5,330,241 Trainable params: 5,330,241 Non-trainable params: 0 _________________________________________________________________
最后我们定义一个函数来实现文本生成。
# 定义函数实现文本生成 temperature = 0.5 def generate_text(model,start_string,num_generate=1000): input_eval = [char2idx[ch] for ch in start_string] input_eval = tf.expand_dims(input_eval,0) text_generated=[] model.reset_states() for _ in range(num_generate): predictions = model(input_eval) predictions = predictions / temperature predictions = tf.squeeze(predictions,0) predicted_id = tf.random.categorical( predictions,num_samples=1)[-1,0].numpy() text_generated.append(idx2char[predicted_id]) input_eval = tf.expand_dims([predicted_id],0) return start_string+''.join(text_generated) # 调用 new_text = generate_text(model2,"first: ") print(new_text)
通过执行代码,我们可以得到当输入为“first: ”时的输出结果为:
first: the sun KING RICHARD II: Have we no disiness of my soul to this shame, That hath a stand and stone of monster than the bears, And so and heard the sea of death. LEONTES: What shall we here? BENVOLIO: Alas, that news, what says my love! CLARENCE: My lord, this is not so dishonour'd him, That you shall have done, so we will be content: The shapes of love, that seal'd the state with heavens of heart, To think it strangers as you that have been To her and loss of his accusers, and stumbled, How is it here, and what a cruel waters, That are you shall seek to them and warriared. TRANIO: And therefore stand upon the world's accusation that still have for some state of mine. HENRY BOLINGBROKE: The arm of my soul, that I may leave you. MERCUTIO: Here's some that will not honour than her mother, I am so med my shoulder, and then at him, Had comfort in the violent of your souls, That comforts of eather doth their souls, With leave before her brothers, and am I come to the torture, That the
我们可以看到生成的文本还是比较规律的。