微信扫一扫打赏支持

《python深度学习》笔记---3.4-2、(a)电影评论分类:二分类问题

《python深度学习》笔记---3.4-2、(a)电影评论分类:二分类问题

一、总结

一句话总结:

输入模型的数据转化:电影评论分类-二分类问题要注意输入模型的数据转化:
把出现的单词对应的数字的位置标为1:将数据集中的每条数据采用类似one_hot编码的方式,就是把出现这个单词的位置的编号这里的数置为1
把出现的单词对应的数字的位置标为1:所以如果一句话单词对应的数字为1、5、7,那么这个10000维向量1/5/7的位置为1,其它为0

 

 

1、二分类的输出处理?

可以one_hot编码:one_hot编码就判断哪一类就好,输出层的话就是2个神经元,
也可以直接sigmoid输出:判断是否大于0.5,输出层只需要一个神经元

 

 

2、仅保留训练数据中前10 000 个最常出现的单词?

num_words=10000:(train_data, train_labels),(test_data, test_labels)= tf.keras.datasets.imdb.load_data(num_words=10000)

参数 num_words=10000 的意思是仅保留训练数据中前10 000 个最常出现的单词。低频单 词将被舍弃。

 

3、找2D张量中的最大值?

将第二维度中的最大值找出来组成list:max([max(sequence) for sequence in train_data])

 

 

4、设置numpy的n维数组的ndtype?

astype方法:train_y=np.array(train_y).astype('float32')
train_y=np.array(train_y).astype('float32')
test_y=np.array(test_y).astype('float32')

 

 

二、(a)电影评论分类:二分类问题

博客对应课程的视频位置:

 

lable由int变为float,感觉没有什么影响

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

步骤

1、读取数据集
2、拆分数据集(拆分成训练数据集和测试数据集)
3、构建模型
4、训练模型
5、检验模型

需求

1、读取数据集

In [2]:
(train_x, train_y),(test_x, test_y)=tf.keras.datasets.imdb.load_data(num_words=10000)
In [3]:
print(type(train_x))
<class 'numpy.ndarray'>
In [4]:
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)
(25000,)
(25000,)
(25000,)
(25000,)
In [6]:
print(train_x[0:2])
print(train_y[0:2])
[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32])
 list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])]
[1 0]

2、格式化数据集

将数据集中的每条数据采用类似one_hot编码的方式,

感觉就是把出现这个单词的位置的编号这里的数置为1
没出现就置为0
所以如果一句话单词对应的数字为1、5、7,那么这个10000维向量1/5/7的位置为1,其它为0

处理x数据

In [7]:
print(type(train_x))
<class 'numpy.ndarray'>
In [8]:
print(len(train_x))
25000
In [9]:
def format_data(sequences,dimension=10000):
    ans=np.zeros((len(sequences),dimension))
    for i,sequence in enumerate(sequences):
        # 批量赋值
        ans[i,sequence]=1.
        pass
    return ans
In [10]:
train_x=format_data(train_x)
test_x=format_data(test_x)
print(train_x)
print(train_x.shape)
print(test_x.shape)
[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]]
(25000, 10000)
(25000, 10000)
In [11]:
print(train_x[0])
[0. 1. 1. ... 0. 0. 0.]
In [12]:
# print([ val for i,val in enumerate(train_x[0])])
In [13]:
ans1=[]
for i,val in enumerate(train_x[0]):
    if(int(val)==1):
        ans1.append(i)
print(ans1)
[1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 25, 26, 28, 30, 32, 33, 35, 36, 38, 39, 43, 46, 48, 50, 51, 52, 56, 62, 65, 66, 71, 76, 77, 82, 87, 88, 92, 98, 100, 103, 104, 106, 107, 112, 113, 117, 124, 130, 134, 135, 141, 144, 147, 150, 167, 172, 173, 178, 192, 194, 215, 224, 226, 256, 283, 284, 297, 316, 317, 336, 381, 385, 386, 400, 407, 447, 458, 469, 476, 480, 515, 530, 546, 619, 626, 670, 723, 838, 973, 1029, 1111, 1247, 1334, 1385, 1415, 1622, 1920, 2025, 2071, 2223, 3766, 3785, 3941, 4468, 4472, 4536, 4613, 5244, 5345, 5535, 5952, 7486]
In [14]:
ans1=[]
for i,val in enumerate(train_x[1]):
    if(int(val)==1):
        ans1.append(i)
print(ans1)
[1, 2, 4, 5, 6, 7, 8, 9, 11, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 26, 28, 29, 30, 32, 33, 35, 37, 38, 43, 45, 46, 50, 52, 64, 68, 69, 78, 89, 93, 95, 98, 102, 110, 114, 116, 118, 119, 120, 123, 125, 126, 130, 131, 134, 136, 145, 148, 152, 154, 163, 165, 168, 174, 175, 188, 189, 194, 207, 220, 228, 229, 245, 249, 285, 299, 340, 349, 371, 394, 398, 455, 462, 491, 605, 625, 647, 656, 715, 775, 952, 954, 1002, 1153, 1157, 1212, 1322, 1355, 1382, 1463, 1523, 1543, 1634, 1649, 1690, 1905, 2300, 2350, 2637, 3103, 3215, 4362, 4369, 4373, 4901, 5012, 6853, 7464, 8003, 8163, 8255, 9837]

处理y(label)数据

In [15]:
train_y=np.array(train_y).astype('float32')
test_y=np.array(test_y).astype('float32')
In [16]:
print(type(train_y))
print(type(test_y))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
In [17]:
print(train_y.dtype)
float32
In [18]:
print(train_y[0:12])
[1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0.]

3、构建模型

In [19]:
# 构建容器
model = tf.keras.Sequential()
# 输入层
model.add(tf.keras.Input(shape=(10000,)))
# 中间层
model.add(tf.keras.layers.Dense(16,activation='relu'))
# 中间层
model.add(tf.keras.layers.Dense(16,activation='relu'))
# 输出层
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))
# 模型的结构
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                160016    
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________

4、训练模型

In [20]:
# 配置优化函数和损失器
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
# 开始训练
history = model.fit(train_x,train_y,epochs=20,batch_size=512,validation_data=(test_x,test_y))
Epoch 1/20
49/49 [==============================] - 2s 47ms/step - loss: 0.4866 - acc: 0.8104 - val_loss: 0.3425 - val_acc: 0.8754
Epoch 2/20
49/49 [==============================] - 1s 26ms/step - loss: 0.2527 - acc: 0.9103 - val_loss: 0.2868 - val_acc: 0.8844
Epoch 3/20
49/49 [==============================] - 1s 28ms/step - loss: 0.1853 - acc: 0.9361 - val_loss: 0.2878 - val_acc: 0.8851
Epoch 4/20
49/49 [==============================] - 1s 27ms/step - loss: 0.1480 - acc: 0.9492 - val_loss: 0.3128 - val_acc: 0.8780
Epoch 5/20
49/49 [==============================] - 1s 27ms/step - loss: 0.1223 - acc: 0.9596 - val_loss: 0.3330 - val_acc: 0.8750
Epoch 6/20
49/49 [==============================] - 1s 26ms/step - loss: 0.1004 - acc: 0.9683 - val_loss: 0.3687 - val_acc: 0.8698
Epoch 7/20
49/49 [==============================] - 1s 27ms/step - loss: 0.0831 - acc: 0.9760 - val_loss: 0.4025 - val_acc: 0.8680
Epoch 8/20
49/49 [==============================] - 1s 28ms/step - loss: 0.0686 - acc: 0.9808 - val_loss: 0.4408 - val_acc: 0.8637
Epoch 9/20
49/49 [==============================] - 1s 30ms/step - loss: 0.0556 - acc: 0.9872 - val_loss: 0.4837 - val_acc: 0.8613
Epoch 10/20
49/49 [==============================] - 1s 29ms/step - loss: 0.0446 - acc: 0.9907 - val_loss: 0.5269 - val_acc: 0.8594
Epoch 11/20
49/49 [==============================] - 1s 27ms/step - loss: 0.0349 - acc: 0.9941 - val_loss: 0.5752 - val_acc: 0.8570
Epoch 12/20
49/49 [==============================] - 1s 26ms/step - loss: 0.0273 - acc: 0.9962 - val_loss: 0.6120 - val_acc: 0.8558
Epoch 13/20
49/49 [==============================] - 1s 26ms/step - loss: 0.0213 - acc: 0.9975 - val_loss: 0.6581 - val_acc: 0.8544
Epoch 14/20
49/49 [==============================] - 1s 27ms/step - loss: 0.0163 - acc: 0.9986 - val_loss: 0.6933 - val_acc: 0.8534
Epoch 15/20
49/49 [==============================] - 1s 30ms/step - loss: 0.0129 - acc: 0.9991 - val_loss: 0.7282 - val_acc: 0.8531
Epoch 16/20
49/49 [==============================] - 1s 27ms/step - loss: 0.0102 - acc: 0.9995 - val_loss: 0.7620 - val_acc: 0.8522
Epoch 17/20
49/49 [==============================] - 1s 29ms/step - loss: 0.0083 - acc: 0.9997 - val_loss: 0.8018 - val_acc: 0.8516
Epoch 18/20
49/49 [==============================] - 1s 27ms/step - loss: 0.0066 - acc: 0.9998 - val_loss: 0.8258 - val_acc: 0.8516
Epoch 19/20
49/49 [==============================] - 1s 28ms/step - loss: 0.0055 - acc: 0.9998 - val_loss: 0.8494 - val_acc: 0.8514
Epoch 20/20
49/49 [==============================] - 1s 26ms/step - loss: 0.0045 - acc: 1.0000 - val_loss: 0.8866 - val_acc: 0.8507

损失和准确率可视化

In [21]:
plt.plot(history.epoch,history.history.get('loss'),'b--',label='train_loss')
plt.plot(history.epoch,history.history.get('val_loss'),'r-',label='test_loss')
plt.title("loss")
plt.legend()
plt.show()
In [22]:
plt.plot(history.epoch,history.history.get('acc'),'b--',label='train_acc')
plt.plot(history.epoch,history.history.get('val_acc'),'r-',label='test_acc')
plt.title("acc")
plt.legend()
plt.show()

5、检验模型

In [23]:
print(model.predict(train_x))
print(train_y)
[[9.9999452e-01]
 [1.0614559e-06]
 [5.5777605e-09]
 ...
 [2.3447652e-04]
 [9.9998438e-01]
 [6.0824757e-03]]
[1. 0. 0. ... 0. 1. 0.]
In [24]:
predict_test_y=model.predict(test_x)
print(predict_test_y)
print(test_y)
[[0.0043372 ]
 [0.9999999 ]
 [0.9835339 ]
 ...
 [0.00850678]
 [0.00191248]
 [0.9965436 ]]
[0. 1. 1. ... 0. 0. 0.]
In [25]:
predict_test_y1=[]
for i in predict_test_y:
    predict_test_y1.append(1 if i>=0.5 else 0)
    pass
print([int(i) for i in test_y][0:20])
print(predict_test_y1[0:20])
[0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0]
[0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0]
In [ ]:
 

 

 

 
posted @ 2020-10-06 23:43  范仁义  阅读(315)  评论(0)    收藏  举报