【487】深度学习用于文本和序列
目录:
将文本分解而成的单元(单词、字符或 n-gram)叫作标记(token),将文本分解成标记的过程叫作分词(tokenization)。所有文本向量化过程都是应用某种分词方案,然后将数值向量与生成的标记相关联。主要有两种方法:对标记作 one-hot 编码(one-hot encoding)与标记嵌入【token embedding,通常只用于单词,叫作词嵌入(word embedding)】。
词袋(bag-of-words)是一种不保存顺序的分词方法(生成的标记组成一个集合,而不是一个序列,舍弃了句子的总体结构),因此它往往被用于浅层的语言处理模型,而不是深度学习模型。
一、单词和字符的 one-hot 编码
one-hot 编码是将标记转换为向量的最常用、最基本的方法。
Keras 的内置函数可以对原始文本数据进行单词级或字符级的 one-hot 编码。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import keras from keras.preprocessing.text import Tokenizer samples = [ 'The cat sat on the mat.' , 'The dog ate my homework.' ] # 创建一个分词器(tokenizer),设置为只考虑前1000个最常见的单词 tokenizer = Tokenizer(num_words = 1000 ) # 构建单词索引,根据samples的内容来创建分词器 tokenizer.fit_on_texts(samples) # 将字符串转换为整数索引组成的列表 # 单词转换为了数字 [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]] sequences = tokenizer.texts_to_sequences(samples) # 文本转换为了 one-hot编码 one_hot_results = tokenizer.texts_to_matrix(samples, mode = 'binary' ) # 每个单词对应的索引值 word_index = tokenizer.word_index print ( 'Found %s unique tokens.' % len (word_index)) output: Found 9 unique tokens. sequences output: [[ 1 , 2 , 3 , 4 , 1 , 5 ], [ 1 , 6 , 7 , 8 , 9 ]] one_hot_results output: array([[ 0. , 1. , 1. , ..., 0. , 0. , 0. ], [ 0. , 1. , 0. , ..., 0. , 0. , 0. ]]) word_index output: { 'the' : 1 , 'cat' : 2 , 'sat' : 3 , 'on' : 4 , 'mat' : 5 , 'dog' : 6 , 'ate' : 7 , 'my' : 8 , 'homework' : 9 } |
二、使用词嵌入
将单词与向量相关联还有另一种常用的强大方法,就是使用密集的词向量(word vector),也叫词嵌入(word embedding)。
- one-hot词向量:稀疏(大部分为 0)、高维(20000 或更高)、硬编码(没有意义)
- word embedding:密集(0 比较少)、低维(256/512/1024)、从数据中学习得到(具有一定的意义)
获取词嵌入的两种方法:
- 在完成主任务的同时学习词嵌入。一开始是随机的词向量,然后对这些词向量进行学习,其学习的方式与学习神经网络的权重相同。
- 预计算好的词嵌入,然后将其加载到模型中。这些词嵌入叫作预训练词嵌入(pretrained word embedding)。
2.1 利用 Embedding 层学习词嵌入
词向量之间的几何关系应该表示这些词之间的语义关系。词嵌入的作用应该是将人类的语言映射到几何空间中。Keras 可以实现学习一个新的嵌入空间。我们要做的就是学习一个层的权重,这个层就是 Embedding 层。
1 2 3 4 5 6 | from keras.layers import Embedding # Embedding层至少需要两个参数: # 标记的个数(这里是1000,即最大单词索引+1)—— 单词数 # 嵌入的维度(这里是64)—— 词向量维度 embedding_layer = Embedding( 1000 , 64 ) |
最好将 Embedding 层理解为一个字典,将整数索引(表示特定单词)映射为密集向量。它接收整数作为输入,并在内部字典中查找这些整数,然后返回相关联的向量。Embedding 层实际上是一种字典查找。
单词索引 → Embedding 层 → 对应的词向量
- 单词索引:单词对应的数字
- Embedding 层:矩阵
- 对应的词向量:相当于字典,单词与向量一一对应
Embedding 层的输入是一个二维整数张量,其形状为 (samples, sequence_length),每个元素是一个整数序列。较短的序列应该用 0 填充,较长的序列应该比截断。
这个 Embedding 层返回一个形状为 (samples, sequence_length, embedding_dimensionality) 的三维浮点数张量。然后可以用 RNN 层或一维卷积层来处理这个三维张量。
将一个 Embedding 层实例化时,它的权重(即标记向量的内部字典)最开始是随机的,与其他层一样。在训练过程中,利用反向传播来逐渐调节这些词向量。
加载 IMDB 数据,准备用于 Embedding 层
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from keras.datasets import imdb from keras import preprocessing # 作为特征的单词个数 max_features = 10000 # 在这多单词截断文本(这些单词都属于 max_feature 个最常见的单词) # 保留文本的长度 maxlen = 20 # 将数据加载为整数列表,每个单词对应一个数字 (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = max_features) # 将整数列表转换成形状为 (samples, maxlen) 的二维整数张量 # 过长的截断,过短的补0 x_train = preprocessing.sequence.pad_sequences(x_train, maxlen = maxlen) x_test = preprocessing.sequence.pad_sequences(x_test, maxlen = maxlen) |
x_train 的每一个元素,都是一个 10 维的向量
在 IMDB 数据上使用 Embedding 层和分类器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | from keras.models import Sequential from keras.layers import Flatten, Dense, Embedding model = Sequential() # input_dim: 10000个最常见的单词 # output_dim: 每个单词的向量维度为8维 # input_length: 为文本长度 # Embedding 层激活的形状为 (samples, maxlen, 8) model.add(Embedding( 10000 , 8 , input_length = maxlen)) # 将三维的嵌入张量展平成形状为 (samples, maxlen * 8) 的二维张量 model.add(Flatten()) model.add(Dense( 1 , activation = 'sigmoid' )) model. compile (optimizer = 'rmsprop' , loss = 'binary_crossentropy' , metrics = [ 'acc' ]) model.summary() history = model.fit(x_train, y_train, epochs = 10 , batch_size = 32 , validation_split = 0.2 ) |
结果如下:

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 20, 8) 80000 _________________________________________________________________ flatten_1 (Flatten) (None, 160) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 161 ================================================================= Total params: 80,161 Trainable params: 80,161 Non-trainable params: 0 _________________________________________________________________ Train on 20000 samples, validate on 5000 samples Epoch 1/10 20000/20000 [==============================] - 1s 69us/step - loss: 0.6759 - acc: 0.6049 - val_loss: 0.6398 - val_acc: 0.6814 Epoch 2/10 20000/20000 [==============================] - 1s 50us/step - loss: 0.5658 - acc: 0.7425 - val_loss: 0.5467 - val_acc: 0.7204 Epoch 3/10 20000/20000 [==============================] - 1s 53us/step - loss: 0.4752 - acc: 0.7808 - val_loss: 0.5113 - val_acc: 0.7384 Epoch 4/10 20000/20000 [==============================] - 1s 59us/step - loss: 0.4263 - acc: 0.8077 - val_loss: 0.5008 - val_acc: 0.7452 Epoch 5/10 20000/20000 [==============================] - 1s 51us/step - loss: 0.3930 - acc: 0.8257 - val_loss: 0.4981 - val_acc: 0.7536 Epoch 6/10 20000/20000 [==============================] - 1s 51us/step - loss: 0.3668 - acc: 0.8395 - val_loss: 0.5014 - val_acc: 0.7530 Epoch 7/10 20000/20000 [==============================] - 1s 51us/step - loss: 0.3435 - acc: 0.8534 - val_loss: 0.5052 - val_acc: 0.7520 Epoch 8/10 20000/20000 [==============================] - 1s 51us/step - loss: 0.3223 - acc: 0.8657 - val_loss: 0.5132 - val_acc: 0.7486 Epoch 9/10 20000/20000 [==============================] - 1s 51us/step - loss: 0.3023 - acc: 0.8766 - val_loss: 0.5213 - val_acc: 0.7490 Epoch 10/10 20000/20000 [==============================] - 1s 51us/step - loss: 0.2839 - acc: 0.8860 - val_loss: 0.5303 - val_acc: 0.7466
精度为 75%,不过只考虑了评论的前 20 个单词,同时仅仅将嵌入序列展开并在上面训练一个 Dense 层,更好的做法是在嵌入序列上添加循环层或一维卷积层,将每个序列作为整体来学习特征。
2.2 使用预训练的词嵌入
- word2vec
- Glove(global vectors for word representation,词表示全局向量)
可以在 Keras 的 Embedding 层直接使用。
三、理解循环神经网络
循环神经网络(RNN,recurrent neural network):它处理序列的方式是,遍历所有序列元素,并保存一个状态(state),其中包含与已查看内容相关的信息。Keras 中 SimpleRNN 层实现了最简单的 RNN。
与 Keras 中的所有循环层一样,SimpleRNN 可以在两种不同的模式下运行:一种是返回每个时间步连续输出的完整序列,即形状为 (batch_size, timesteps, output_features) 的三维张量;另一种是只返回每个输入序列的最终输出,即形状为 (batch_size, output_features) 的二维张量。这两种模式由 return_sequences 这个构造函数参数来控制。我们来看一个使用 SimpleRNN 的例子,它只返回最后一个时间步的输出。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from keras.models import Sequential from keras.layers import Embedding, SimpleRNN model = Sequential() model.add(Embedding( 10000 , 32 )) model.add(SimpleRNN( 32 )) model.summary() output: _________________________________________________________________ Layer ( type ) Output Shape Param # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = embedding_1 (Embedding) ( None , None , 32 ) 320000 _________________________________________________________________ simple_rnn_1 (SimpleRNN) ( None , 32 ) 2080 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 322 , 080 Trainable params: 322 , 080 Non - trainable params: 0 _________________________________________________________________ |
下面这个例子返回完整的状态序列。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | model = Sequential() model.add(Embedding( 10000 , 32 )) model.add(SimpleRNN( 32 , return_sequences = True )) model.summary() output: _________________________________________________________________ Layer ( type ) Output Shape Param # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = embedding_2 (Embedding) ( None , None , 32 ) 320000 _________________________________________________________________ simple_rnn_2 (SimpleRNN) ( None , None , 32 ) 2080 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 322 , 080 Trainable params: 322 , 080 Non - trainable params: 0 _________________________________________________________________ |
为了提高网络的表达能力,将多个循环层逐个堆叠有时也是很有用的。在这种情况下,你需要让所有中间层都返回完整的输出序列。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | model = Sequential() model.add(Embedding( 10000 , 32 )) model.add(SimpleRNN( 32 , return_sequences = True )) model.add(SimpleRNN( 32 , return_sequences = True )) model.add(SimpleRNN( 32 , return_sequences = True )) model.add(SimpleRNN( 32 )) model.summary() output: _________________________________________________________________ Layer ( type ) Output Shape Param # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = embedding_3 (Embedding) ( None , None , 32 ) 320000 _________________________________________________________________ simple_rnn_3 (SimpleRNN) ( None , None , 32 ) 2080 _________________________________________________________________ simple_rnn_4 (SimpleRNN) ( None , None , 32 ) 2080 _________________________________________________________________ simple_rnn_5 (SimpleRNN) ( None , None , 32 ) 2080 _________________________________________________________________ simple_rnn_6 (SimpleRNN) ( None , 32 ) 2080 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 328 , 320 Trainable params: 328 , 320 Non - trainable params: 0 _________________________________________________________________ |
接下来,我们将这个模型应用于 IMDB 电影评论分类问题。首先,对数据进行预处理。
准备 IMDB 数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from keras.datasets import imdb from keras.preprocessing import sequence max_features = 10000 # number of words to consider as features maxlen = 500 # cut texts after this number of words (among top max_features most common words) batch_size = 32 print ( 'Loading data...' ) (input_train, y_train), (input_test, y_test) = imdb.load_data(num_words = max_features) print ( len (input_train), 'train sequences' ) print ( len (input_test), 'test sequences' ) print ( 'Pad sequences (samples x time)' ) input_train = sequence.pad_sequences(input_train, maxlen = maxlen) input_test = sequence.pad_sequences(input_test, maxlen = maxlen) print ( 'input_train shape:' , input_train.shape) print ( 'input_test shape:' , input_test.shape) |
用 Embedding 层和一个 SimpleRNN 层来训练模型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from keras.layers import Dense model = Sequential() # 单词数 # 单词向量维度 model.add(Embedding(max_features, 32 )) # 输出RNN的单元数 model.add(SimpleRNN( 32 )) model.add(Dense( 1 , activation = 'sigmoid' )) model. compile (optimizer = 'rmsprop' , loss = 'binary_crossentropy' , metrics = [ 'acc' ]) history = model.fit(input_train, y_train, epochs = 10 , batch_size = 128 , validation_split = 0.2 ) |
四、理解 LSTM 层和 GRU 层
SimpleRNN 由于梯度消失问题(vanishing gradient problem),无法学到这种长期依赖信息。
LSTM,long short-term memory,有 3 个 gate。
详细参考其他关于 LSTM 的博文
使用 Keras 中的 LSTM 层
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from keras.layers import LSTM, Dense model = Sequential() model.add(Embedding(max_features, 32 )) model.add(LSTM( 32 )) model.add(Dense( 1 , activation = 'sigmoid' )) model. compile (optimizer = 'rmsprop' , loss = 'binary_crossentropy' , metrics = [ 'acc' ]) history = model.fit(input_train, y_train, epochs = 10 , batch_size = 128 , validation_split = 0.2 ) |
验证精度达到 89%,比 SimpleRNN 网络好很多,这主要是因为 LSTM 受梯度消失问题的影响要小得多。
五、循环神经网络的高级用法
- 循环 dropout(recurrent dropout):这是一种特殊的内置方法,在循环层中使用 dropout 来降低过拟合。
- 堆叠循环层(stacking recurrent layers):可以提高网络的表达能力(代价是更高的计算负荷)。
- 双向循环层(bidirectional recurrent layer)。将相同的信息以不同的方式呈现给循环网络,可以提高精度并缓解遗忘问题。
GRU,gated recurrent unit,门控循环单元,其工作原理与 LSTM 相同,但它做了一些简化,因此运行的计算代价更低(虽然表示能力可能不如 LSTM)。机器学习中到处可以见到这种计算代价与表示能力之间的折中。
5.1 循环 dropout(recurrent dropout)
Keras 的每个循环层都有两个与 dropout 相关的参数:一个是 dropout,它是一个浮点数,指定该层输入单元的 dropout 比率;另一个是 recurrent_dropout,指定循环单元的 dropout 比率。
1 2 3 4 5 6 7 8 | ... model = Sequential() model.add(layers.GRU( 32 , dropout = 0.2 , recurrent_dropout = 0.2 , input_shape = ( None , float_data.shape[ - 1 ]))) model.add(layers.Dense( 1 )) ... |
双向 RNN 层的工作原理
5.2 堆叠循环层(stacking recurrent layers)
可以构建更加强大的循环网络。在 Keras 中逐个堆叠循环层,所有中间层都应该返回完整的输出序列(一个 3D 张量),而不是只返回最后一个时间步的输出。这可以通过指定 return_sequences=True 来实现。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.GRU( 32 , dropout = 0.1 , recurrent_dropout = 0.5 , return_sequences = True , input_shape = ( None , float_data.shape[ - 1 ]))) model.add(layers.GRU( 64 , activation = 'relu' , dropout = 0.1 , recurrent_dropout = 0.5 )) model.add(layers.Dense( 1 )) |
5.3 双向循环层(bidirectional recurrent layer)
可以考虑人生如果返老还童,那么会有完全不同的心智模型,这就是双向 RNN 的想法。
在 Keras 中将一个双向 RNN 实例化,我们需要使用 Bidirectional 层,它的第一个参数是一个循环层实例。Bidirectional 对这个循环层创建了第二个单独实例,然后使用一个实例按正序处理输入序列,另一个实例按照逆序处理输入序列。
Bidirectional 相当于有 LSTM 正反两层,可以参见下面的对比参数,BiLSTM 是 LSTM 的两倍。
1 2 3 4 5 6 | model = Sequential() model.add(layers.Embedding( 10000 , 32 )) model.add(layers.Bidirectional(layers.LSTM( 32 ))) model.add(layers.Dense( 1 , activation = 'sigmoid' )) model.summary() |
output:
1 2 3 4 5 6 7 8 9 10 11 12 13 | _________________________________________________________________ Layer ( type ) Output Shape Param # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = embedding_1 (Embedding) ( None , None , 32 ) 320000 _________________________________________________________________ bidirectional_1 (Bidirection ( None , 64 ) 16640 _________________________________________________________________ dense_3 (Dense) ( None , 1 ) 65 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 336 , 705 Trainable params: 336 , 705 Non - trainable params: 0 _________________________________________________________________ |
LSTM:
1 2 3 4 5 6 | model = Sequential() model.add(layers.Embedding( 10000 , 32 )) model.add(layers.LSTM( 32 )) model.add(layers.Dense( 1 , activation = 'sigmoid' )) model.summary() |
output:
1 2 3 4 5 6 7 8 9 10 11 12 13 | _________________________________________________________________ Layer ( type ) Output Shape Param # = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = embedding_4 (Embedding) ( None , None , 32 ) 320000 _________________________________________________________________ lstm_4 (LSTM) ( None , 32 ) 8320 _________________________________________________________________ dense_6 (Dense) ( None , 1 ) 33 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 328 , 353 Trainable params: 328 , 353 Non - trainable params: 0 _________________________________________________________________ |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2012-10-01 【082】Jason Mraz - I'm Yours 歌词