人工智能不过尔尔，基于Python3深度学习库Keras/TensorFlow打造属于自己的聊天机器人(ChatRobot)

原文转载自「刘悦的技术博客」https://v3u.cn/a_id_178

聊天机器人(ChatRobot)的概念我们并不陌生，也许你曾经在百无聊赖之下和Siri打情骂俏过，亦或是闲暇之余与小爱同学谈笑风生，无论如何，我们都得承认，人工智能已经深入了我们的生活。目前市面上提供三方api的机器人不胜枚举：微软小冰、图灵机器人、腾讯闲聊、青云客机器人等等，只要我们想，就随时可以在app端或者web应用上进行接入。但是，这些应用的底层到底如何实现的？在没有网络接入的情况下，我们能不能像美剧《西部世界》(Westworld)里面描绘的那样，机器人只需要存储在本地的“心智球”就可以和人类沟通交流，如果你不仅仅满足于当一个“调包侠”，请跟随我们的旅程，本次我们将首度使用深度学习库Keras/TensorFlow打造属于自己的本地聊天机器人，不依赖任何三方接口与网络。

首先安装相关依赖：

pip3 install Tensorflow  
pip3 install Keras  
pip3 install nltk

然后撰写脚本test_bot.py导入需要的库：

import nltk  
import ssl  
from nltk.stem.lancaster import LancasterStemmer  
stemmer = LancasterStemmer()  
  
import numpy as np  
from keras.models import Sequential  
from keras.layers import Dense, Activation, Dropout  
from keras.optimizers import SGD  
import pandas as pd  
import pickle  
import random

这里有一个坑，就是自然语言分析库NLTK会报一个错误：



Resource punkt not found

正常情况下，只要加上一行下载器代码即可

import nltk  
nltk.download('punkt')

但是由于学术上网的原因，很难通过python下载器正常下载，所以我们玩一次曲线救国，手动自己下载压缩包：

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip

解压之后，放在你的用户目录下即可：

C:\Users\liuyue\tokenizers\nltk_data\punkt

ok，言归正传，开发聊天机器人所面对的最主要挑战是对用户输入信息进行分类，以及能够识别人类的正确意图（这个可以用机器学习解决，但是太复杂，我偷懒了，所以用的深度学习Keras）。第二就是怎样保持语境，也就是分析和跟踪上下文，通常情况下，我们不太需要对用户意图进行分类，只需要把用户输入的信息当作聊天机器人问题的答案即可，所这里我们使用Keras深度学习库用于构建分类模型。

聊天机器人的意向和需要学习的模式都定义在一个简单的变量中。不需要动辄上T的语料库。我们知道如果玩机器人的，手里没有语料库，就会被人嘲笑，但是我们的目标只是为某一个特定的语境建立一个特定聊天机器人。所以分类模型作为小词汇量创建，它仅仅将能够识别为训练提供的一小组模式。

说白了就是，所谓的机器学习，就是你重复的教机器做某一件或几件正确的事情，在训练中，你不停的演示怎么做是正确的，然后期望机器在学习中能够举一反三，只不过这次我们不教它很多事情，只一件，用来测试它的反应而已，是不是有点像你在家里训练你的宠物狗？只不过狗子可没法和你聊天。

这里的意向数据变量我就简单举个例子，如果愿意，你可以用语料库对变量进行无限扩充：

intents = {"intents": [  
        {"tag": "打招呼",  
         "patterns": ["你好", "您好", "请问", "有人吗", "师傅","不好意思","美女","帅哥","靓妹","hi"],  
         "responses": ["您好", "又是您啊", "吃了么您内","您有事吗"],  
         "context": [""]  
        },  
        {"tag": "告别",  
         "patterns": ["再见", "拜拜", "88", "回见", "回头见"],  
         "responses": ["再见", "一路顺风", "下次见", "拜拜了您内"],  
         "context": [""]  
        },  
   ]  
}

可以看到，我插入了两个语境标签，打招呼和告别，包括用户输入信息以及机器回应数据。

在开始分类模型训练之前，我们需要先建立词汇。模式经过处理后建立词汇库。每一个词都会有词干产生通用词根，这将有助于能够匹配更多用户输入的组合。

for intent in intents['intents']:  
    for pattern in intent['patterns']:  
        # tokenize each word in the sentence  
        w = nltk.word_tokenize(pattern)  
        # add to our words list  
        words.extend(w)  
        # add to documents in our corpus  
        documents.append((w, intent['tag']))  
        # add to our classes list  
        if intent['tag'] not in classes:  
            classes.append(intent['tag'])  
  
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]  
words = sorted(list(set(words)))  
  
classes = sorted(list(set(classes)))  
  
print (len(classes), "语境", classes)  
  
print (len(words), "词数", words)

输出：

2 语境 ['告别', '打招呼']  
14 词数 ['88', '不好意思', '你好', '再见', '回头见', '回见', '帅哥', '师傅', '您好', '拜拜', '有人吗', '美女', '请问', '靓妹']

训练不会根据词汇来分析，因为词汇对于机器来说是没有任何意义的，这也是很多中文分词库所陷入的误区，其实机器并不理解你输入的到底是英文还是中文，我们只需要将单词或者中文转化为包含0/1的数组的词袋。数组长度将等于词汇量大小，当当前模式中的一个单词或词汇位于给定位置时，将设置为1。

# create our training data  
training = []  
# create an empty array for our output  
output_empty = [0] * len(classes)  
# training set, bag of words for each sentence  
for doc in documents:  
    # initialize our bag of words  
    bag = []  
  
    pattern_words = doc[0]  
     
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]  
  
    for w in words:  
        bag.append(1) if w in pattern_words else bag.append(0)  
      
   
    output_row = list(output_empty)  
    output_row[classes.index(doc[1])] = 1  
      
    training.append([bag, output_row])  
  
random.shuffle(training)  
training = np.array(training)  
  
train_x = list(training[:,0])  
train_y = list(training[:,1])

我们开始进行数据训练，模型是用Keras建立的，基于三层。由于数据基数小，分类输出将是多类数组，这将有助于识别编码意图。使用softmax激活来产生多类分类输出（结果返回一个0/1的数组：[1,0,0,…,0]–这个数组可以识别编码意图）。

model = Sequential()  
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))  
model.add(Dropout(0.5))  
model.add(Dense(64, activation='relu'))  
model.add(Dropout(0.5))  
model.add(Dense(len(train_y[0]), activation='softmax'))  
  
  
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)  
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])  
  
  
model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)

这块是以200次迭代的方式执行训练，批处理量为5个，因为我的测试数据样本小，所以100次也可以，这不是重点。

开始训练：

14/14 [==============================] - 0s 32ms/step - loss: 0.7305 - acc: 0.5000  
Epoch 2/200  
14/14 [==============================] - 0s 391us/step - loss: 0.7458 - acc: 0.4286  
Epoch 3/200  
14/14 [==============================] - 0s 390us/step - loss: 0.7086 - acc: 0.3571  
Epoch 4/200  
14/14 [==============================] - 0s 395us/step - loss: 0.6941 - acc: 0.6429  
Epoch 5/200  
14/14 [==============================] - 0s 426us/step - loss: 0.6358 - acc: 0.7143  
Epoch 6/200  
14/14 [==============================] - 0s 356us/step - loss: 0.6287 - acc: 0.5714  
Epoch 7/200  
14/14 [==============================] - 0s 366us/step - loss: 0.6457 - acc: 0.6429  
Epoch 8/200  
14/14 [==============================] - 0s 899us/step - loss: 0.6336 - acc: 0.6429  
Epoch 9/200  
14/14 [==============================] - 0s 464us/step - loss: 0.5815 - acc: 0.6429  
Epoch 10/200  
14/14 [==============================] - 0s 408us/step - loss: 0.5895 - acc: 0.6429  
Epoch 11/200  
14/14 [==============================] - 0s 548us/step - loss: 0.6050 - acc: 0.6429  
Epoch 12/200  
14/14 [==============================] - 0s 468us/step - loss: 0.6254 - acc: 0.6429  
Epoch 13/200  
14/14 [==============================] - 0s 388us/step - loss: 0.4990 - acc: 0.7857  
Epoch 14/200  
14/14 [==============================] - 0s 392us/step - loss: 0.5880 - acc: 0.7143  
Epoch 15/200  
14/14 [==============================] - 0s 370us/step - loss: 0.5118 - acc: 0.8571  
Epoch 16/200  
14/14 [==============================] - 0s 457us/step - loss: 0.5579 - acc: 0.7143  
Epoch 17/200  
14/14 [==============================] - 0s 432us/step - loss: 0.4535 - acc: 0.7857  
Epoch 18/200  
14/14 [==============================] - 0s 357us/step - loss: 0.4367 - acc: 0.7857  
Epoch 19/200  
14/14 [==============================] - 0s 384us/step - loss: 0.4751 - acc: 0.7857  
Epoch 20/200  
14/14 [==============================] - 0s 346us/step - loss: 0.4404 - acc: 0.9286  
Epoch 21/200  
14/14 [==============================] - 0s 500us/step - loss: 0.4325 - acc: 0.8571  
Epoch 22/200  
14/14 [==============================] - 0s 400us/step - loss: 0.4104 - acc: 0.9286  
Epoch 23/200  
14/14 [==============================] - 0s 738us/step - loss: 0.4296 - acc: 0.7857  
Epoch 24/200  
14/14 [==============================] - 0s 387us/step - loss: 0.3706 - acc: 0.9286  
Epoch 25/200  
14/14 [==============================] - 0s 430us/step - loss: 0.4213 - acc: 0.8571  
Epoch 26/200  
14/14 [==============================] - 0s 351us/step - loss: 0.2867 - acc: 1.0000  
Epoch 27/200  
14/14 [==============================] - 0s 3ms/step - loss: 0.2903 - acc: 1.0000  
Epoch 28/200  
14/14 [==============================] - 0s 366us/step - loss: 0.3010 - acc: 0.9286  
Epoch 29/200  
14/14 [==============================] - 0s 404us/step - loss: 0.2466 - acc: 0.9286  
Epoch 30/200  
14/14 [==============================] - 0s 428us/step - loss: 0.3035 - acc: 0.7857  
Epoch 31/200  
14/14 [==============================] - 0s 407us/step - loss: 0.2075 - acc: 1.0000  
Epoch 32/200  
14/14 [==============================] - 0s 457us/step - loss: 0.2167 - acc: 0.9286  
Epoch 33/200  
14/14 [==============================] - 0s 613us/step - loss: 0.1266 - acc: 1.0000  
Epoch 34/200  
14/14 [==============================] - 0s 534us/step - loss: 0.2906 - acc: 0.9286  
Epoch 35/200  
14/14 [==============================] - 0s 463us/step - loss: 0.2560 - acc: 0.9286  
Epoch 36/200  
14/14 [==============================] - 0s 500us/step - loss: 0.1686 - acc: 1.0000  
Epoch 37/200  
14/14 [==============================] - 0s 387us/step - loss: 0.0922 - acc: 1.0000  
Epoch 38/200  
14/14 [==============================] - 0s 430us/step - loss: 0.1620 - acc: 1.0000  
Epoch 39/200  
14/14 [==============================] - 0s 371us/step - loss: 0.1104 - acc: 1.0000  
Epoch 40/200  
14/14 [==============================] - 0s 488us/step - loss: 0.1330 - acc: 1.0000  
Epoch 41/200  
14/14 [==============================] - 0s 381us/step - loss: 0.1322 - acc: 1.0000  
Epoch 42/200  
14/14 [==============================] - 0s 462us/step - loss: 0.0575 - acc: 1.0000  
Epoch 43/200  
14/14 [==============================] - 0s 1ms/step - loss: 0.1137 - acc: 1.0000  
Epoch 44/200  
14/14 [==============================] - 0s 450us/step - loss: 0.0245 - acc: 1.0000  
Epoch 45/200  
14/14 [==============================] - 0s 470us/step - loss: 0.1824 - acc: 1.0000  
Epoch 46/200  
14/14 [==============================] - 0s 444us/step - loss: 0.0822 - acc: 1.0000  
Epoch 47/200  
14/14 [==============================] - 0s 436us/step - loss: 0.0939 - acc: 1.0000  
Epoch 48/200  
14/14 [==============================] - 0s 396us/step - loss: 0.0288 - acc: 1.0000  
Epoch 49/200  
14/14 [==============================] - 0s 580us/step - loss: 0.1367 - acc: 0.9286  
Epoch 50/200  
14/14 [==============================] - 0s 351us/step - loss: 0.0363 - acc: 1.0000  
Epoch 51/200  
14/14 [==============================] - 0s 379us/step - loss: 0.0272 - acc: 1.0000  
Epoch 52/200  
14/14 [==============================] - 0s 358us/step - loss: 0.0712 - acc: 1.0000  
Epoch 53/200  
14/14 [==============================] - 0s 4ms/step - loss: 0.0426 - acc: 1.0000  
Epoch 54/200  
14/14 [==============================] - 0s 370us/step - loss: 0.0430 - acc: 1.0000  
Epoch 55/200  
14/14 [==============================] - 0s 368us/step - loss: 0.0292 - acc: 1.0000  
Epoch 56/200  
14/14 [==============================] - 0s 494us/step - loss: 0.0777 - acc: 1.0000  
Epoch 57/200  
14/14 [==============================] - 0s 356us/step - loss: 0.0496 - acc: 1.0000  
Epoch 58/200  
14/14 [==============================] - 0s 427us/step - loss: 0.1485 - acc: 1.0000  
Epoch 59/200  
14/14 [==============================] - 0s 381us/step - loss: 0.1006 - acc: 1.0000  
Epoch 60/200  
14/14 [==============================] - 0s 421us/step - loss: 0.0183 - acc: 1.0000  
Epoch 61/200  
14/14 [==============================] - 0s 344us/step - loss: 0.0788 - acc: 0.9286  
Epoch 62/200  
14/14 [==============================] - 0s 529us/step - loss: 0.0176 - acc: 1.0000

ok，200次之后，现在模型已经训练好了，现在声明一个方法用来进行词袋转换：

def clean_up_sentence(sentence):  
    # tokenize the pattern - split words into array  
    sentence_words = nltk.word_tokenize(sentence)  
    # stem each word - create short form for word  
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]  
    return sentence_words

def bow(sentence, words, show_details=True):  
    # tokenize the pattern  
    sentence_words = clean_up_sentence(sentence)  
    # bag of words - matrix of N words, vocabulary matrix  
    bag = [0]*len(words)    
    for s in sentence_words:  
        for i,w in enumerate(words):  
            if w == s:   
                # assign 1 if current word is in the vocabulary position  
                bag[i] = 1  
                if show_details:  
                    print ("found in bag: %s" % w)  
    return(np.array(bag))

测试一下，看看是否可以命中词袋：

p = bow("你好", words)  
print (p)

返回值：

found in bag: 你好  
[0 0 1 0 0 0 0 0 0 0 0 0 0 0]

很明显匹配成功，词已入袋。

在我们打包模型之前，可以使用model.predict函数对用户输入进行分类测试，并根据计算出的概率返回用户意图（可以返回多个意图，根据概率倒序输出）：

def classify_local(sentence):  
    ERROR_THRESHOLD = 0.25  
      
    # generate probabilities from the model  
    input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])  
    results = model.predict([input_data])[0]  
    # filter out predictions below a threshold, and provide intent index  
    results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]  
    # sort by strength of probability  
    results.sort(key=lambda x: x[1], reverse=True)  
    return_list = []  
    for r in results:  
        return_list.append((classes[r[0]], str(r[1])))  
    # return tuple of intent and probability  
      
    return return_list

测试一下：

print(classify_local('您好'))

返回值：

found in bag: 您好  
[('打招呼', '0.999913')]  
liuyue:mytornado liuyue$

再测：

print(classify_local('88'))

返回值：

found in bag: 88  
[('告别', '0.9995449')]

完美，匹配出打招呼的语境标签，如果愿意，可以多测试几个，完善模型。

测试完成之后，我们可以将训练好的模型打包，这样每次调用之前就不用训练了：

json_file = model.to_json()  
with open('v3ucn.json', "w") as file:  
   file.write(json_file)  
  
model.save_weights('./v3ucn.h5f')

这里模型分为数据文件(json)以及权重文件(h5f)，将它们保存好，一会儿会用到。

接下来，我们来搭建一个聊天机器人的API，这里我们使用目前非常火的框架Fastapi，将模型文件放入到项目的目录之后，编写main.py:

import random  
import uvicorn  
from fastapi import FastAPI  
app = FastAPI()  
  
  
def classify_local(sentence):  
    ERROR_THRESHOLD = 0.25  
      
    # generate probabilities from the model  
    input_data = pd.DataFrame([bow(sentence, words)], dtype=float, index=['input'])  
    results = model.predict([input_data])[0]  
    # filter out predictions below a threshold, and provide intent index  
    results = [[i,r] for i,r in enumerate(results) if r>ERROR_THRESHOLD]  
    # sort by strength of probability  
    results.sort(key=lambda x: x[1], reverse=True)  
    return_list = []  
    for r in results:  
        return_list.append((classes[r[0]], str(r[1])))  
    # return tuple of intent and probability  
      
    return return_list  
  
@app.get('/')  
async def root(word: str = None):  
      
    from keras.models import model_from_json  
    # # load json and create model  
    file = open("./v3ucn.json", 'r')  
    model_json = file.read()  
    file.close()  
    model = model_from_json(model_json)  
    model.load_weights("./v3ucn.h5f")  
  
    wordlist = classify_local(word)  
    a = ""  
    for intent in intents['intents']:  
        if intent['tag'] == wordlist[0][0]:  
            a = random.choice(intent['responses'])  
  
  
  
    return {'message':a}  
  
if __name__ == "__main__":  
    uvicorn.run(app, host="127.0.0.1", port=8000)

这里的：

from keras.models import model_from_json  
file = open("./v3ucn.json", 'r')  
model_json = file.read()  
file.close()  
model = model_from_json(model_json)  
model.load_weights("./v3ucn.h5f")

用来导入刚才训练好的模型库，随后启动服务：

uvicorn main:app --reload

效果是这样的：

结语：毫无疑问，科技改变生活，聊天机器人可以让我们没有佳人相伴的情况下，也可以听闻莺啼燕语，相信不久的将来，笑语盈盈、衣香鬓影的“机械姬”亦能伴吾等于清风明月之下。

原文转载自「刘悦的技术博客」 https://v3u.cn/a_id_178

posted @ 2022-07-28 23:43 刘悦的技术博客阅读(280) 评论(0) 收藏举报

刷新页面返回顶部

刘悦的技术博客

人工智能不过尔尔，基于Python3深度学习库Keras/TensorFlow打造属于自己的聊天机器人(ChatRobot)

公告