《python深度学习》笔记---8.1、使用LSTM生成文本
《python深度学习》笔记---8.1、使用LSTM生成文本
一、总结
一句话总结:
其实原理非常简单,就是单层的LSTM把训练数据中单词与字符的统计规律学好,然后softmax层相当于分类对应到词表中的各个字符的概率
from tensorflow.keras import layers model = keras.models.Sequential() model.add(layers.LSTM(128, input_shape=(maxlen, len(chars)))) model.add(layers.Dense(len(chars), activation='softmax'))
1、人工智能的目的?
【人工智能不是为了替代我们的智能】:的确,到目前为止,我们见到的人工智能艺术作品的水平还很低。人工智能还远远比不上 人类编剧、画家和作曲家。但是,替代人类始终都不是我们要谈论的主题,人工智能不会替代 我们自己的智能,
【而是会为我们的生活和工作带来更多的智能】:而是会为我们的生活和工作带来更多的智能,即另一种类型的智能。在许多 领域,特别是创新领域中,人类将会使用人工智能作为增强自身能力的工具,实现比人工智能 更加强大的智能。
2、人工智能发挥作用的地方?
【简单的模式识别与专业技能】:很大一部分的艺术创作都是简单的模式识别与专业技能。这正是很多人认为没有吸引力、 甚至可有可无的那部分过程。
【我们的感知模式、语言和艺术作品都具有统计结构】:学习这种结构是深度学习算法所擅长的。
3、机器学习模型只是一种数学运算?
【机器学习模型能够对图像、 音乐和故事的统计潜在空间(latent space)进行学习,然后从这个空间中采样(sample)】:创造 出与模型在训练数据中所见到的艺术作品具有相似特征的新作品。
【机器学习模型只是一种数学运算】:当然,这种采样本身并不是 艺术创作行为。它只是一种数学运算,算法并没有关于人类生活、人类情感或我们人生经验的 基础知识;相反,它从一种与我们的经验完全不同的经验中进行学习。
4、使用 LSTM 生成文本实例中 如何生成序列数据?
【使用前面的标记作为输入,训练一个网络来预测序列中接下来的一个或多个标记】:用深度学习生成序列数据的通用方法,就是使用前面的标记作为输入,训练一个网络(通常是循环神经网络或卷积神经网络)来预测序列中接下来的一个或多个标记。
【例如,给定输入 the cat is on the ma,训练网络来预测目标 t,即下一个字符。】
5、语言模型(language model)?
【给定前面的标记,能够对下一个标记的概率进行建模的任何网络】:与前面处理文本数据时一样,标记 (token)通常是单词或字符,给定前面的标记,能够对下一个标记的概率进行建模的任何网络 都叫作语言模型(language model)。
【语言的潜在空间(latent space),即语言的统计结构】:语言模型能够捕捉到语言的潜在空间(latent space),即语言的统计结构。
6、使用 LSTM 生成文本实例中 的采样和条件数据是什么?
【采样(sample,即生成新序列)】:一旦训练好了这样一个语言模型,就可以从中采样(sample,即生成新序列)。
【初始文本字符串[即条件数据(conditioning data)]】:向模型中输入一个初始文本字符串[即条件数据(conditioning data)],要求模型生成下一个字符或下一个单词(甚至可以同时生成多个标记),然后将生成的输出添加到输入数据中,并多次重复这一过程
7、生成文本时,如何选择下一个字符至关重要?
【贪婪采样】:一种简单的方法是贪婪采样(greedy sampling), 就是始终选择可能性最大的下一个字符。但这种方法会得到重复的、可预测的字符串,看起来 不像是连贯的语言。
【随机采样】:一种更有趣的方法是做出稍显意外的选择:在采样过程中引入随机性,即 从下一个字符的概率分布中进行采样。这叫作随机采样(stochastic sampling,stochasticity 在这 个领域中就是“随机”的意思)。在这种情况下,根据模型结果,如果下一个字符是 e 的概率为 0.3,那么你会有 30% 的概率选择它。
8、为什么采样(生成新序列)的时候需要有一定的随机性?
【纯随机采样有最大的熵,随机性大】:考虑一个极端的例子——纯随机采样,即从均匀概率分布中 抽取下一个字符,其中每个字符的概率相同。这种方案具有最大的随机性,换句话说,这种概 率分布具有最大的熵。当然,它不会生成任何有趣的内容。
【贪婪采样有最小的熵,没有任何随机性】:再来看另一个极端——贪婪采样。 贪婪采样也不会生成任何有趣的内容,它没有任何随机性,即相应的概率分布具有最小的熵。
【更小的熵可以让生成的序列具有更加可预测的结构(因此可能看起来更真实),而更大的熵会得到更加出人意料且更有创造性的序列】:但是,还有许多其他中间点具有更大或更小的熵,你可能希望都研究一下。更小的 熵可以让生成的序列具有更加可预测的结构(因此可能看起来更真实),而更大的熵会得到更加 出人意料且更有创造性的序列。
9、softmax 温度(softmax temperature)?
【为了在采样过程中控制随机性的大小】:我们引入一个叫作 softmax 温度(softmax temperature) 的参数
【用于表示采样概率分布的熵,即表示所选择的下一个字符会有多么出人意料或多么可预测】
10、用于预测下一个字符的单层 LSTM 模型?
其实原理非常简单,就是单层的LSTM把训练数据中单词与字符的统计规律学好,然后softmax层相当于分类对应到词表中的各个字符的概率
from tensorflow.keras import layers model = keras.models.Sequential() model.add(layers.LSTM(128, input_shape=(maxlen, len(chars)))) model.add(layers.Dense(len(chars), activation='softmax'))
11、使用LSTM生成文本 注意点?
我们可以生成离散的序列数据,其方法是:给定前面的标记,训练一个模型来预测接下 来的一个或多个标记。
对于文本来说,这种模型叫作语言模型。它可以是单词级的,也可以是字符级的。
对下一个标记进行采样,需要在坚持模型的判断与引入随机性之间寻找平衡。
处理这个问题的一种方法是使用softmax 温度。一定要尝试多种不同的温度,以找到合适的那一个。
二、8.1、使用LSTM生成文本
博客对应课程的视频位置:
[...]
Implementing character-level LSTM text generation
Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the English language.
Preparing the data
Let's start by downloading the corpus and converting it to lowercase:
from tensorflow import keras
import numpy as np
path = keras.utils.get_file(
'nietzsche.txt',
origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))
print(text[0:400])
Next, we will extract partially-overlapping sequences of length maxlen
, one-hot encode them and pack them in a 3D Numpy array x
of shape (sequences, maxlen, unique_characters)
. Simultaneously, we prepare a array y
containing the corresponding targets: the one-hot encoded characters that come right after each extracted sequence.
# Length of extracted character sequences
maxlen = 60
# We sample a new sequence every `step` characters
step = 3
# This holds our extracted sequences
sentences = []
# This holds the targets (the follow-up characters)
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))
# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)
# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
print(chars)
Building the network
Our network is a single LSTM
layer followed by a Dense
classifier and softmax over all possible characters. But let us note that recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in recent times.
from tensorflow.keras import layers
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))
Since our targets are one-hot encoded, we will use categorical_crossentropy
as the loss to train the model:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
Training the language model and sampling from it
Given a trained model and a seed text snippet, we generate new text by repeatedly:
- 1) Drawing from the model a probability distribution over the next character given the text available so far
- 2) Reweighting the distribution to a certain "temperature"
- 3) Sampling the next character at random according to the reweighted distribution
- 4) Adding the new character at the end of the available text
This is the code we use to reweight the original probability distribution coming out of the model, and draw a character index from it (the "sampling function"):
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of temperature in the sampling strategy.
import random
import sys
for epoch in range(1, 60):
print('epoch', epoch)
# Fit the model for 1 epoch on the available training data
model.fit(x, y,
batch_size=128,
epochs=1)
# Select a text seed at random
start_index = random.randint(0, len(text) - maxlen - 1)
generated_text = text[start_index: start_index + maxlen]
print('--- Generating with seed: "' + generated_text + '"')
for temperature in [0.2, 0.5, 1.0, 1.2]:
print('------ temperature:', temperature)
sys.stdout.write(generated_text)
# We generate 400 characters
for i in range(400):
sampled = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(generated_text):
sampled[0, t, char_indices[char]] = 1.
preds = model.predict(sampled, verbose=0)[0]
next_index = sample(preds, temperature)
next_char = chars[next_index]
generated_text += next_char
generated_text = generated_text[1:]
sys.stdout.write(next_char)
sys.stdout.flush()
print()
As you can see, a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible (such as "eterned" or "troveration"). With a high temperature, the local structure starts breaking down and most words look like semi-random strings of characters. Without a doubt, here 0.5 is the most interesting temperature for text generation in this specific setup. Always experiment with multiple sampling strategies! A clever balance between learned structure and randomness is what makes generation interesting.
Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and realistic than ours. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is sampling data from a statistical model of which characters come after which characters. Language is a communication channel, and there is a distinction between what communications are about, and the statistical structure of the messages in which communications are encoded. To evidence this distinction, here is a thought experiment: what if human language did a better job at compressing communications, much like our computers do with most of our digital communications? Then language would be no less meaningful, yet it would lack any intrinsic statistical structure, thus making it impossible to learn a language model like we just did.
Take aways
- We can generate discrete sequence data by training a model to predict the next tokens(s) given previous tokens.
- In the case of text, such a model is called a "language model" and could be based on either words or characters.
- Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness.
- One way to handle this is the notion of softmax temperature. Always experiment with different temperatures to find the "right" one.