评论智能分类

一、选题的背景

如今网络购物越来越发达，人们在挑选东西的时候往往会看一下商品的评价信息，信息的量是十分巨大的，因此人工分类已经不能满足需求了，所以就需要计算机来辅助人们对评论进行智能分类

二、深度学习的设计方案

本题选用的数据来自腾讯的开源数据
本次采用深度框架是 Tensorflow , keras

Tensorflow：TensorFlow™是一个基于数据流编程（dataflow programming）的符号数学系统，被广泛应用于各类机器学习（machine learning）算法的编程实现，其前身是谷歌的神经网络算法库DistBelief [1] 。

Tensorflow拥有多层级结构，可部署于各类服务器、PC终端和网页并支持GPU和TPU高性能数值计算，被广泛应用于谷歌内部的产品开发和各领域的科学研究 [1-2] 。

TensorFlow由谷歌人工智能团队谷歌大脑（Google Brain）开发和维护，拥有包括TensorFlow Hub、TensorFlow Lite、TensorFlow Research Cloud在内的多个项目以及各类应用程序接口（Application Programming Interface, API） [2] 。自2015年11月9日起，TensorFlow依据阿帕奇授权协议（Apache 2.0 open source license）开放源代码 [2] 。

Kears：Keras是一个由Python编写的开源人工神经网络库，可以作为Tensorflow、Microsoft-CNTK和Theano的高阶应用程序接口，进行深度学习模型的设计、调试、评估、应用和可视化 [1] 。

Keras在代码结构上由面向对象方法编写，完全模块化并具有可扩展性，其运行机制和说明文档有将用户体验和使用难度纳入考虑，并试图简化复杂算法的实现难度 [1] 。Keras支持现代人工智能领域的主流算法，包括前馈结构和递归结构的神经网络，也可以通过封装参与构建统计学习模型 [2] 。在硬件和开发环境方面，Keras支持多操作系统下的多GPU并行计算，可以根据后台设置转化为Tensorflow、Microsoft-CNTK等系统下的组件 [3] 。

Keras的主要开发者是谷歌工程师François Chollet，此外其GitHub项目页面包含6名主要维护者和超过800名直接贡献者 [4] 。Keras在其正式版本公开后，除部分预编译模型外，按MIT许可证开放源代码 [1] 。

本次采用的NLP（自然语言处理）技术

处理自然语言的关键是要让计算机“理解”自然语言，所以自然语言处理又叫做自

然语言理解(NLU，NaturalLanguage Understanding)，也称为计算语言学(Computational Linguistics)。一方面它是语言信息处理的一个分支，另一方面它是人工智能(AI, Artificial

Intelligence)的核心课题之一。

难点在于计算机不像人们的大脑一样可以进行上下文联系阅读，计算机看不懂文字

解决的思路就是搭建类似人的循环神经网络，在计算机模拟上下文的阅读来进行解决问题

三、深度学习的实现步骤

目的：让计算机智能分类评论

数据的处理：用jieba进行分词

建立循环神经网络模型：

训练模型：经过训练模型的准确率可以达到85%

四、总结

通过本次的学习，感受到深度学习的强大，但也知道计算机无论再强大终究是没能像人一样拥有情感，计算机能做的不过是用词来记住感情的情景
改进：数据集还是太少了，计算机的算力还是太小了，以后有机会要是用云计算

五、代码

  1 import jieba
  2 
  3  
  4 
  5 test1 = jieba.cut("杭州西湖风景很好,是旅游胜地!每年吸引很多游客!",cut_all=True)
  6 
  7 print("全模式: " + "| " .join(test1))
  8 
  9  
 10 
 11 test2 = jieba.cut("杭州西湖风景很好,是旅游胜地!每年吸引很多游客!",cut_all=False)
 12 
 13 print("精准模式: " + "| " .join(test2))
 14 
 15  
 16 
 17 test3 = jieba.cut("杭州西湖风景很好,是旅游胜地!每年吸引很多游客!")
 18 
 19 print("搜索引擎模式: " + "| " .join(test3))
 20 
 21  
 22 
 23  
 24 
 25 import jieba
 26 
 27 import codecs
 28 
 29 import numpy as np
 30 
 31 from keras.utils import np_utils
 32 
 33 from keras.models import Sequential
 34 
 35 from keras.layers import Dense, Dropout
 36 
 37 from keras.layers.embeddings import Embedding
 38 
 39 from keras.layers.recurrent import LSTM
 40 
 41 from keras.preprocessing import sequence
 42 
 43 from keras.preprocessing.text import Tokenizer
 44 
 45 import tensorflow as tf
 46 
 47 import keras.backend as K
 48 
 49 from keras.callbacks import LearningRateScheduler
 50 
 51  
 52 
 53 def fencil(s):
 54 
 55     cut=jieba.cut(s)
 56 
 57     text=" ".join(cut)
 58 
 59     return text
 60 
 61  
 62 
 63 # 导入数据对数据进行切割,做成训练集,测试集,训练标签,测试标签
 64 
 65 def makeTrainTestSets():
 66 
 67     commentstype={"中立":0, "正面":1, "负面":2}
 68 
 69     train_label=[]
 70 
 71     train_content=[]
 72 
 73     test_label=[]
 74 
 75     test_content=[]
 76 
 77     fp=codecs.open("sentiment.train.txt", "r", "utf-8")
 78 
 79     n = 0
 80 
 81     # 将一部分数据做成训练集--train_content
 82 
 83     while True:
 84 
 85         cur_text = fp.readline()
 86 
 87         if cur_text=="":
 88 
 89             break
 90 
 91         item=cur_text.split("||")
 92 
 93         train_label.append(commentstype[item[0]])
 94 
 95         tmp="".join(item[1:])
 96 
 97         text=fencil(tmp)
 98 
 99         train_content.append(text)
100 
101         n=n+1
102 
103     fp.close()
104 
105     print("the number of train datasets is:"+str(n))
106 
107     fp=codecs.open("sentiment.test.txt", "r", "utf-8")
108 
109     n = 0
110 
111     # 将一部分数据做成测试集--test_content
112 
113     while True:
114 
115         cur_text = fp.readline()
116 
117         if cur_text == "":
118 
119             break
120 
121         item = cur_text.split("||")
122 
123         test_label.append(commentstype[item[0]])
124 
125         tmp = "".join(item[1:])
126 
127         text = fencil(tmp)
128 
129         test_content.append(text)
130 
131         n = n + 1
132 
133     fp.close()
134 
135     print("the number of test datasets is:" + str(n))
136 
137     # 将各个数据用 numpy 以 .npy 文件的形式来保存
138 
139     np.save("nlp_train_content.npy",train_content)
140 
141     np.save("nlp_train_label.npy",train_label)
142 
143     np.save("nlp_test_content.npy",test_content)
144 
145     np.save("nlp_test_label.npy",test_label)
146 
147     return (train_content,train_label),(test_content,test_label)
148 
149 # 检查是否生成所需要的数据集,如果没有生成的话就调用 makeTrainTestSets 函数生成所需要的数据
150 
151 try:
152 
153     train_content=np.load("nlp_train_content.npy")
154 
155     test_content=np.load("nlp_test_content.npy")
156 
157     train_label=np.load("nlp_train_label.npy")
158 
159     test_label=np.load("nlp_test_label_npy")
160 
161 except:
162 
163     (train_content,train_label), (test_content,test_label) = makeTrainTestSets()
164 
165 print("***********succes to finish fenci make train and test data**********")
166 
167 print(train_label[0])
168 
169 print(train_content[0])
170 
171 print(test_content[0])
172 
173 # 创建字典
174 
175 token = Tokenizer(num_words=3000)   # num_words 这个参数是规定这个字典的最大长度是3000
176 
177 token.fit_on_texts(train_content)   # 将 train_content 加在字典中
178 
179 print(token.document_count)
180 
181  
182 
183 # 数据预处理
184 
185 trainSqe=token.texts_to_sequences(train_content)
186 
187 testSqe=token.texts_to_sequences(test_content)
188 
189 trainPadSqe=sequence.pad_sequences(trainSqe,maxlen=30)           # maxlen 这个参数是规定每一个 train 和 test 数据集的长度为30
190 
191 testPadSqe=sequence.pad_sequences(testSqe,maxlen=30)             # maxlen 这个参数是规定每一个 train 和 test 数据集的长度为30
192 
193  
194 
195 print(len(trainSqe))
196 
197 print(len(testSqe))
198 
199 print(trainPadSqe[0])
200 
201 print(trainPadSqe.shape)
202 
203  
204 
205 # 进行 one_hot 编码
206 
207 train_label_ohe=np_utils.to_categorical(train_label)
208 
209 test_label_ohe=np_utils.to_categorical(test_label)
210 
211 print(test_label_ohe[0])
212 
213  
214 
215 # 创建模型     本次模型采用的是 < 循环神经网络 (RNN) >
216 
217 model = Sequential()
218 
219 model.add(Embedding(output_dim=50, input_dim=3000, input_length=30))              # output_dim 这个参数是把文字转换成50个维度
220 
221 model.add(Dropout(0.25))
222 
223 model.add(LSTM(64))                                       # 长短期记忆网络( LSTM ) 是一种 RNN 的特殊的类型,可以学习长期依赖信息
224 
225 model.add(Dropout(0.3))                                   # 随机丢弃30%的神经元
226 
227 model.add(Dense(units=3,activation='relu'))               # relu是一个激活函数
228 
229 model.add(Dropout(0.4))
230 
231 model.add(Dense(units=3,activation='softmax'))
232 
233 print(model.summary())
234 
235  
236 
237 # 加载权重文件
238 
239 try:
240 
241     model.load_weights("./sentimenttype.h5")
242 
243     print("准备好了!")
244 
245 except:
246 
247     print("还没准备好")
248 
249  
250 
251 model.compile(loss="categorical_crossentropy",optimizer='adam',metrics=['accuracy'])
252 
253 results=model.evaluate(testPadSqe, test_label_ohe)
254 
255 print('accuracy=',results[1])
256 
257

  1 import jieba
  2 
  3 import codecs
  4 
  5 import numpy as np
  6 
  7 from keras.utils import np_utils
  8 
  9 from keras.models import Sequential
 10 
 11 from keras.layers import Dense, Dropout
 12 
 13 from keras.layers.embeddings import Embedding
 14 
 15 from keras.layers.recurrent import LSTM
 16 
 17 from keras.preprocessing import sequence
 18 
 19 from keras.preprocessing.text import Tokenizer
 20 
 21 import tensorflow as tf
 22 
 23  
 24 
 25 def fencil(s):
 26 
 27     cut=jieba.cut(s)
 28 
 29     text=" ".join(cut)
 30 
 31     return text
 32 
 33  
 34 
 35 # 导入数据对数据进行切割,做成训练集,测试集,训练标签,测试标签
 36 
 37 def makeTrainTestSets():
 38 
 39     commentstype={"中立":0, "正面":1, "负面":2}
 40 
 41     train_label=[]
 42 
 43     train_content=[]
 44 
 45     test_label=[]
 46 
 47     test_content=[]
 48 
 49     fp=codecs.open("sentiment.train.txt", "r", "utf-8")
 50 
 51     n = 0
 52 
 53  
 54 
 55     # 将一部分数据做成训练集--train_content
 56 
 57     while True:
 58 
 59         cur_text = fp.readline()
 60 
 61         if cur_text=="":
 62 
 63             break
 64 
 65         item=cur_text.split("||")
 66 
 67         train_label.append(commentstype[item[0]])
 68 
 69         tmp="".join(item[1:])
 70 
 71         text=fencil(tmp)
 72 
 73         train_content.append(text)
 74 
 75         n=n+1
 76 
 77  
 78 
 79     fp.close()
 80 
 81  
 82 
 83     print("the number of train datasets is:"+str(n))
 84 
 85  
 86 
 87     fp=codecs.open("sentiment.test.txt", "r", "utf-8")
 88 
 89  
 90 
 91     n = 0
 92 
 93  
 94 
 95     # 将一部分数据做成测试集--test_content
 96 
 97     while True:
 98 
 99         cur_text = fp.readline()
100 
101         if cur_text == "":
102 
103             break
104 
105         item = cur_text.split("||")
106 
107         test_label.append(commentstype[item[0]])
108 
109         tmp = "".join(item[1:])
110 
111         text = fencil(tmp)
112 
113         test_content.append(text)
114 
115         n = n + 1
116 
117  
118 
119     fp.close()
120 
121  
122 
123     print("the number of test datasets is:" + str(n))
124 
125  
126 
127     # 将各个数据用 numpy 以 .npy 文件的形式来保存
128 
129     np.save("nlp_train_content.npy",train_content)
130 
131  
132 
133     np.save("nlp_train_label.npy",train_label)
134 
135  
136 
137     np.save("nlp_test_content.npy",test_content)
138 
139  
140 
141     np.save("nlp_test_label.npy",test_label)
142 
143  
144 
145     return (train_content,train_label),(test_content,test_label)
146 
147  
148 
149 # 检查是否生成所需要的数据集,如果没有生成的话就调用 makeTrainTestSets 函数生成所需要的数据
150 
151 try:
152 
153     train_content=np.load("nlp_train_content.npy")
154 
155     test_content=np.load("nlp_test_content.npy")
156 
157     train_label=np.load("nlp_train_label.npy")
158 
159     test_label=np.load("nlp_test_label_npy")
160 
161 except:
162 
163     (train_content,train_label), (test_content,test_label) = makeTrainTestSets()
164 
165  
166 
167 print("***********succes to finish fenci make train and test data**********")
168 
169  
170 
171 print(train_label[0])
172 
173  
174 
175 print(train_content[0])
176 
177  
178 
179 print(test_content[0])
180 
181  
182 
183 # 创建字典
184 
185 token = Tokenizer(num_words=3000)   # num_words 这个参数是规定这个字典的最大长度是3000
186 
187  
188 
189 token.fit_on_texts(train_content)   # 将 train_content 加在字典中
190 
191  
192 
193 print(token.document_count)
194 
195  
196 
197 # 数据预处理
198 
199 trainSqe=token.texts_to_sequences(train_content)
200 
201  
202 
203 testSqe=token.texts_to_sequences(test_content)
204 
205  
206 
207 trainPadSqe=sequence.pad_sequences(trainSqe,maxlen=30)           # maxlen 这个参数是规定每一个 train 和 test 数据集的长度为30
208 
209  
210 
211 testPadSqe=sequence.pad_sequences(testSqe,maxlen=30)             # maxlen 这个参数是规定每一个 train 和 test 数据集的长度为30
212 
213  
214 
215 print(len(trainSqe))
216 
217  
218 
219 print(len(testSqe))
220 
221  
222 
223 print(trainPadSqe[0])
224 
225  
226 
227 print(trainPadSqe.shape)
228 
229  
230 
231  
232 
233 # 进行 one_hot 编码
234 
235 train_label_ohe=np_utils.to_categorical(train_label)
236 
237  
238 
239 test_label_ohe=np_utils.to_categorical(test_label)
240 
241  
242 
243 print(test_label_ohe[0])
244 
245  
246 
247  
248 
249 # 创建模型     本次模型采用的是 < 循环神经网络 (RNN) >
250 
251  
252 
253 model = Sequential()                                      #
254 
255  
256 
257 model.add(Embedding(output_dim=50, input_dim=3000, input_length=30))                       # output_dim 这个参数是把文字转换成50个维度
258 
259  
260 
261 model.add(Dropout(0.25))
262 
263  
264 
265 model.add(LSTM(64))                                       # 长短期记忆网络( LSTM ) 是一种 RNN 的特殊的类型,可以学习长期依赖信息
266 
267  
268 
269 model.add(Dropout(0.3))
270 
271  
272 
273 model.add(Dense(units=3,activation='relu'))
274 
275  
276 
277 model.add(Dropout(0.4))
278 
279  
280 
281 model.add(Dense(units=3,activation='softmax'))
282 
283  
284 
285 print(model.summary())
286 
287  
288 
289  
290 
291 # 加载权重文件
292 
293 try:
294 
295     model.load_weights("sentimenttype.h5")
296 
297  
298 
299     print("准备好了!")
300 
301  
302 
303 except:
304 
305     print("还没准备好")
306 
307  
308 
309  
310 
311 '''
312 
313 def scheduler(epoch):
314 
315     if epoch % 100 == 0 and epoch != 0:
316 
317         lr = K.get_value(model.optimizer.lr)
318 
319         K.set_value(model.optimizer.lr, lr * 0.01)
320 
321         print("lr changed to {}".format(lr * 0.01))
322 
323     return K.get_value(model.optimizer.lr)
324 
325  
326 
327 reduce_lr = LearningRateScheduler(scheduler)
328 
329 '''
330 
331  
332 
333 # 对模型进行训练,对参数的调优
334 
335 model.compile(loss="categorical_crossentropy",optimizer='adam',metrics=['accuracy'])
336 
337  
338 
339 results=model.evaluate(testPadSqe, test_label_ohe)
340 
341  
342 
343 x  = results[1]
344 
345  
346 
347 for i in range(10000):
348 
349     model.compile(tf.keras.optimizers.Adam(lr=0.00001),
350 
351  
352 
353                   loss="categorical_crossentropy",
354 
355  
356 
357                   metrics=['acc'])
358 
359     train_history = model.fit(x=trainPadSqe,
360 
361                               y=train_label_ohe,
362 
363                               validation_split=0.2,
364 
365                               epochs=10,
366 
367                               batch_size=128,
368 
369                               verbose=1,
370 
371                               )
372 
373     results = model.evaluate(testPadSqe, test_label_ohe)
374 
375     print('accuracy=', results[1])
376 
377     if results[1] >= x :
378 
379         x = results[1]
380 
381         model.save_weights("sentimenttype.h5")
382 
383         print("Save the newly trained model")
384 
385     elif results[1] >= 0.85:
386 
387         model.save_weights("sentimenttype.h5")
388 
389         print("Save the newly trained model")
390 
391         break
392 
393  
394 
395  
396 
397 '''
398 
399 # 检验模型的准确率
400 
401 model.compile(loss="categorical_crossentropy",optimizer='adam',metrics=['accuracy'])
402 
403 results=model.evaluate(testPadSqe, test_label_ohe)
404 
405 '''