Embedding实现1pre1

1.Embedding -- 一种编码方法

为什么使用Embedding?

独热码：独热码的位宽要与词汇量一致，如果词汇量增大时，非常浪费资源。数据量大、过于稀疏，映射之间是独立的，没有表现出关联性。

Embedding：是一种单词编码方法，用低维向量实现了编码。这种编码通过神经网络训练优化，能表达出单词间的相关性。

Tensorflow2 中的词向量空间编码层：

2.Embedding实现1pre1

与独热编码不同的是因为需要把输入特征变成 Embedding 层期待的形状：第一个维度是送入样本数、第二个维度是循环核时间展开步数

在模型部分相比于独热编码形式多了一个 Embedding 层对输入数据进行编码，这一层会生成一个五行两列的可训练参数矩阵，实现编码可训练。

只需要将读到的输入字母直接查找表示它的ID 值，然后调整为 Embedding 层希望的形状输入网络进行预测即可。

predict(输入数据, batch_size=整数) 返回前向传播计算结果

注： predict 参数详解。

(1)x：输入数据， Numpy 数组（或者 Numpy 数组的列表，如果模型有多个输出）；

(2)batch_size：整数，由于 GPU 的特性， batch_size最好选用 8， 16， 32， 64……，如果未指定，默认为 32；

(3)verbose: 日志显示模式， 0 或 1；

(4)steps: 声明预测结束之前的总步数（批次样本），默认值 None；

(5)返回：预测的 Numpy 数组（或数组列表）

#! /usr/bin/env python
# -*- coding:utf-8 -*-

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding
import matplotlib.pyplot as plt
import os


input_word = 'abcde'
# 单词映射到数值id的字典
w_to_id = {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
x_train = [w_to_id['a'], w_to_id['b'], w_to_id['c'], w_to_id['d'], w_to_id['e']]
y_train = [w_to_id['b'], w_to_id['c'], w_to_id['d'], w_to_id['e'], w_to_id['a']]

np.random.seed(7)
np.random.shuffle(x_train)
np.random.seed(7)
np.random.shuffle(y_train)
tf.random.set_seed(7)

# 使x_train符合Embedding输入要求：[送入样本数， 循环核时间展开步数]。
# 此处整个数据集送入，送入样本数为len(x_train)；输入1个字母出结果，循环核时间展开步数为1;
x_train = np.reshape(x_train, (len(x_train), 1))
y_train = np.array(y_train)

# 逐层搭建网络，设计一个3个记忆体的循环层+一个全连接层
# Embedding(5, 2)词汇表大小是5，每个字母用两个数字表示，编码维度是2
model = tf.keras.Sequential([
    Embedding(5, 2),
    SimpleRNN(3),
    Dense(5, activation='softmax')
])

# 配置训练方法
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.01),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    metrics=['sparse_categorical_accuracy']
)

# 设置模型保存路径
checkpoint_save_path = "./checkpoint/rnn_embedding_1pre1.ckpt"

# 判断保存的模型是否存在
if os.path.exists(checkpoint_save_path + '.index'):
    print('------------------load the model-------------------')
    # 读取模型
    model.load_weights(checkpoint_save_path)

# 保存模型，借助tensorflow给出的回调函数，直接保存参数和网络
'''
 monitor 配合 save_best_only 可以保存最优模型，包括：训练损失最小模型、测试损失最小模型、训练准确率最高模型、测试准确率最高模型等。
'''
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_save_path,
    save_weights_only=True,
    save_best_only=True,
    monitor='loss'  # 由于fit没有给出测试集，不计算测试集准确率，根据loss，保存最优模型
)

# 执行训练过程
history = model.fit(x_train, y_train, batch_size=32, epochs=100, callbacks=[cp_callback])

# 对网络结构参数的统计
model.summary()

# 参数提取，写到weights.txt文本中
file = open('./weights.txt', 'w')

# model.trainable_variables 返回模型中可训练的参数
for v in model.trainable_variables:
    file.write(str(v.name) + '\n')
    file.write(str(v.shape) + '\n')
    file.write(str(v.numpy()) + '\n')
file.close()

###############################################    show   ###############################################
# 显示训练集和验证集的acc和loss曲线
acc = history.history['sparse_categorical_accuracy']
loss = history.history['loss']

plt.subplot(1, 2, 1)
plt.plot(acc, label='Training Accuracy')
plt.title('Training Accuracy')  # 图标题
plt.legend()  # 图例

plt.subplot(1, 2, 2)
plt.plot(loss, label='Training Loss')
plt.title('Training Loss')  # 图标题
plt.legend()  # 图例
plt.show()

############### predict #############
preNum = int(input('input the number of test alphbet:'))
for i in range(preNum):
    alphabet1 = input("input test alphabet:")
    # 变成模型需要的输入
    alphabet = [w_to_id[alphabet1]]
    # 使alphabet符合Embedding输入要求：[送入样本数， 循环核时间展开步数]。
    # 此处验证效果送入了1个样本，送入样本数为1；输入1个字母出结果，所以循环核时间展开步数为1;
    alphabet = np.reshape(alphabet, (1, 1))
    result = model.predict(alphabet)
    pred = tf.argmax(result, axis=1)
    pred = int(pred)
    tf.print(alphabet1 + '->' + input_word[pred])

　　输出结果：

posted @ 2020-08-26 17:12 GumpYan 阅读(293) 评论(0) 编辑收藏举报

刷新页面返回顶部

Gump Yan

Embedding实现1pre1

公告