【506】NLP实战系列（三）—— keras 读取及处理 IMDB 数据库

　　利用 IMDB 数据进行 Sentiment Analysis。

　　通过 keras.datasets 里面下载，注意下载的结构，并进行预处理。

from keras.datasets import imdb
from keras import preprocessing
 
# Number of words to consider as features
max_features = 10000
# Cut texts after this number of words 
# (among top max_features most common words)
maxlen = 20
 
# Load the data as lists of integers.
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

　　x_train

type: numpy.ndarray
shape: (25000, )，每一个文本的长度不同，需要补充 0 或者截取，保证长度相同
都是由数字组成，数字与单词对应

　　y_train: 二分类 0 和 1

　　需要对文本长度进行调节

# This turns our lists of integers
# into a 2D integer tensor of shape `(samples, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

　　长度设置为 maxlen=20。

　　得到的矩阵可以直接作为 Embedding 层的输入数据。

参考：填充序列pad_sequences

语法：

1 2	`keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32',` `padding='pre', truncating='pre', value=0.)`

　　将长为nb_samples的序列（标量序列）转化为形如(nb_samples,nb_timesteps)2D numpy array。如果提供了参数maxlen，nb_timesteps=maxlen，否则其值为最长序列的长度。其他短于该长度的序列都会在后部填充0以达到该长度。长于nb_timesteps的序列将会被截断，以使其匹配目标长度。padding和截断发生的位置分别取决于padding和truncating.

参数：

sequences：浮点数或整数构成的两层嵌套列表
maxlen：None或整数，为序列的最大长度。大于此长度的序列将被截短，小于此长度的序列将在后部填0.
dtype：返回的numpy array的数据类型
padding：‘pre’或‘post’，确定当需要补0时，在序列的起始还是结尾补
truncating：‘pre’或‘post’，确定当需要截断序列时，从起始还是结尾截断
value：浮点数，此值将在填充时代替默认的填充值0

返回值：

　　返回形如(nb_samples,nb_timesteps)的2D张量

举例：　　

>>> a = np.array([[2, 3],
          [3, 4, 6],
          [7, 8, 9, 10]])
>>> a
array([list([2, 3]), list([3, 4, 6]), list([7, 8, 9, 10])], dtype=object)
>>> import keras
Using TensorFlow backend.
>>> b = keras.preprocessing.sequence.pad_sequences(a, maxlen=10)
>>> b
array([[ 0,  0,  0,  0,  0,  0,  0,  0,  2,  3],
       [ 0,  0,  0,  0,  0,  0,  0,  3,  4,  6],
       [ 0,  0,  0,  0,  0,  0,  7,  8,  9, 10]])
>>> c = keras.preprocessing.sequence.pad_sequences(a, maxlen=10, padding='post')
>>> c
array([[ 2,  3,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 3,  4,  6,  0,  0,  0,  0,  0,  0,  0],
       [ 7,  8,  9, 10,  0,  0,  0,  0,  0,  0]])
>>> d = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding='post')
>>> d
array([[ 2,  3,  0],
       [ 3,  4,  6],
       [ 8,  9, 10]])
>>> e = keras.preprocessing.sequence.pad_sequences(a, maxlen=3)
>>> e
array([[ 0,  2,  3],
       [ 3,  4,  6],
       [ 8,  9, 10]])
>>> f = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding='post', truncating='post')
>>> f
array([[2, 3, 0],
       [3, 4, 6],
       [7, 8, 9]])

posted on 2020-12-27 12:33 McDelfino 阅读(328) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布：重大改进与新特性概览！
· .NET10 - 预览版1新功能体验（一）

历史上的今天：
2017-12-27 【280】◀▶ ArcPy 常用工具说明

alex_bn_lee

导航

公告

统计

搜索

常用链接

最新随笔

我的标签

积分与排名

随笔分类 (1762)

随笔档案 (1207)

相册 (9)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

【506】NLP实战系列（三）—— keras 读取及处理 IMDB 数据库

语法：

参数：

返回值：

举例：