用 Mahout 和 Elasticsearch 实现推荐系统

原文地址

本文内容

软件
步骤
控制相关性
总结
参考资料

本文介绍如何用带 Apache Mahout 的 MapR Sandbox for Hadoop 和 Elasticsearch 搭建推荐引擎，只需要很少的代码。

This tutorial will give step-by-step instructions on how to:

使用的电影评分数据位于 http://grouplens.org/datasets/movielens/
使用 Apache Mahout 的协同过滤（collaborative filtering）搭建和训练机器学习模型
使用 Elasticsearch 的搜索技术简化推荐系统的开发

迁移到：http://www.bdata-cap.com/newsinfo/1712675.html

软件

该文章运行在 MapReduce Sandbox。还要求在 Sandbox 上安装 Elasticsearch 和 Mahout。

从 http://grouplens.org/datasets/movielens/ 下载 10M MovieLens 数据
安装 Mahout
安装 Elasticsearch

步骤

Step 1: 索引（Index）电影元数据到 Elasticsearch

在 Elasticsearch 中，默认情况下，文档的所有字段都会被索引。最简单的文档是只有一级 JSON 结构。文档包含在索引中，文档中的类型告诉 Elasticsearch 如何解释文档中的字段。

你可以把 Elasticsearch 的索引看做是关系型数据库中的数据库实例，而类型看做是数据库表，字段看做表定义（但是这个字段，在 Elasticsearch 中的意义更广泛），文档看做是表的某行记录。

针对本例，文档类型是 film。并具有如下字段：电影ID（id）、标题（title）、上映时间（year）、电影类型/标签（genre，基因）、指示（indicators）、indicators数组的数量（numFields）：

 "id": "65006",

 "title": "Impulse",

 "year": "2008",

 "genre": ["Mystery","Thriller"],

 "indicators": ["154","272",”154","308", "535", "583", "593", "668", "670", "680", "702", "745"],

 "numFields": 12

通过 9200 端口访问 Elasticsearch RESTful API 与其通信，或者命令行用 curl 命令。参看 Elasticsearch REST interface 和 Elasticsearch 101 tutorial。

curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

使用 Elasticsearch's REST API 的 put mapping 命令可以定义文档的类型。下面的请求在 bigmovie 索引中创建名为 film 的映射（mapping）。该映射定义一个类型为 integer 类型的 numFields 字段。默认情况，所有字段都被存储并索引，整型也如此。

curl -XPUT 'http://localhost:9200/bigmovie' -d '

  "mappings": {

    "film" : {

      "properties" : {

        "numFields" : { "type" :   "integer" }

}'

电影信息包含在 movies.dat 文件中。文件的每行表示一部电影，字段的含义如下所示：

MovieID::Title::Genres

例如：

65006::Impulse (2008)::Mystery|Thriller

图 1 电影《冲动（Impulse）》（2008）、类型“悬疑/惊悚”

下面 Python 脚本把 movies.dat 文件中的数据转换成 JSON 格式，以便导入 Elasticsearch：

import re

import json

count=0

with open('movies.dat','rb') as csv_file:

   content = csv_file.readlines()

   for line in content:

        fixed = re.sub("::", "\t", line).rstrip().split("\t")

   if len(fixed)==3:

          title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))

          genre = fixed[2].split('|')

          print '{ "create" : { "_index" : "bigmovie", "_type" : "film",

          "_id" : "%s" } }' %  fixed[0]

          print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'

          % (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

运行该 Python 文件，转换结果输出到 index.json：

$ python index.py > index.json

将产生如下 Elasticsearch 需要的格式：

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }

{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }

{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }

文件中的每行创建索引和类型，并添加电影信息。这是利用 Elasticsearch 批量导入数据。

Elasticsearch 批量 API 可以执行对索引的操作，用同一个 API，不同的 http 请求（如 get、put、post、delete）。下面命令让 Elasticsearch 批量加载 index.json 文中的内容：

curl -s -XPOST localhost:9200/_bulk --data-binary @index.json; echo

加载电影信息后，你就可以利用 REST API 进行查询了。你也可以使用 Chrome 的 Elasticsearch 插件——Sense 进行操作（Kibana 4 提供的一个插件）。示例如下所示：

下面是检索 id 为 1237的电影：

Step 2: 使用 Mahout 从用户评分数据中创建 Movie indicators

评分包含在 ratings.dat 文件中。该文件每行表示某个用户对某个电影的评分，格式如下所示：

UserID::MovieID::Rating::Timestamp

例如：

71567::2294::5::912577968

71567::2338::2::912578016

ratings.data 文件用 "::" 做分隔符，转换成 tab 后 Mahout 才能使用。可以用 sed 命令把 :: 替换成 tab：

sed -i 's/::/\t/g' ratings.dat

该命令打开文件，把"::" 替换成"\t" 后，重新保存。Updates are only supported with MapR NFS and thus this command probably won't work on other NFS-on-Hadoop implementations. MapR Direct Access NFS allows files to be modified (supports random reads and writes) and accessed via mounting the Hadoop cluster over NFS.

sed 命令会产生如下格式的内容，该格式可以作为 Mahout 的输入：

71567    2294    5    912580553

71567    2338    2    912580553

一般格式为：item1 item2 rating timestamp，即“物品1 物品2 评分”，本例不使用 timestamp。

启动 Mahout 物品相似度（itemsimilarity）作业，命令如下所示：

 mahout itemsimilarity \

  --input /user/user01/mlinput/ratings.dat \

  --output /user/user01/mloutput \

  --similarityClassname SIMILARITY_LOGLIKELIHOOD \

  --booleanData TRUE \

  --tempDir /user/user01/temp

The argument “-s SIMILARITY_LOGLIKELIHOOD” tells the recommender to use the Log Likelihood Ratio (LLR) method for determining which items co-occur anomalously often and thus which co-occurrences can be used as indicators of preference. 相似度默认是 0.9；this can be adjusted based on the use case with the --threshold parameter, which will discard pairs with lower similarity (the default is a fine choice). Mahout 通过启动很多 Hadoop MapReduce 作业计算推荐，最后将产生输出文件，该文件位于 /user/user01/mloutput 目录。输出文件格式如下所示：

64957   64997   0.9604835425701245
64957   65126   0.919355104432831
64957   65133   0.9580439772229588

一般格式为：item1id item2id similarity，即“物品1 物品2 相似度”。

Step 3: 添加 Movie indicators 到 Elasticsearch 的电影文档

下一步，我们从上面的输出文件添加 indicators 到 Elasticsearch 的 film 文档。例如，把电影的 indicators 放到 indicators 字段：

  "id": "65006",

  "title": "Impulse",

  "year": "2008",

  "genre": ["Mystery","Thriller"],

  "indicators": ["1076", "1936", "2057", "2204"],

  "numFields": 4

左面的表显示文档中包含 indicator 的内容，右边的表显示哪些文档包含某个 indicator：

图 2 文档与 indicator

如果想要检索 indicator 为 1237 和 551 的电影，那么本例将返回 id 为 8298 的文档（电影）。如果检索 1237 或 551，那么将返回 id 为 8298、3 和 64418 的电影。

下面脚本将读取 Mahout 的输出文件 part-r-00000，为每部电影创建 indicator 数组，然后输出 JSON 文件，用该文件更新 Elasticsearch bigmovie 索引的 film 类型的 indicator 字段。

import fileinput

from string import join

import json

import csv

import json

### read the output from MAHOUT and collect into hash ###

with open('/user/user01/mloutput/part-r-00000','rb') as csv_file:

    csv_reader = csv.reader(csv_file,delimiter='\t')

    old_id = ""

    indicators = []

    update = {"update" : {"_id":""}}

    doc = {"doc" : {"indicators":[], "numFields":0}}

    for row in csv_reader:

        id = row[0]

        if (id != old_id and old_id != ""):

            update["update"]["_id"] = old_id

            doc["doc"]["indicators"] = indicators

            doc["doc"]["numFields"] = len(indicators)

            print(json.dumps(update))

            print(json.dumps(doc))

            indicators = [row[1]]

        else:

            indicators.append(row[1])

        old_id = id

下面命令会执行 update.py 的 Python 脚本，并输出 update.json：

$ python update.py > update.json

上面 Python 脚本将创建如下内容的文件：

{"update": {"_id": "1"}}

{"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}}

{"update": {"_id": "2"}}

{"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}

在命令行，用 curl 命令调用 Elasticsearch REST bulk 请求，把该文件 update.json 作为输入，就可以更新 indicator 字段：

$ curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo

Step 4: 检索 Film 索引的 indicator 字段进行推荐

现在，你就可以检索 film 的 indicator 字段进行查询并推荐。例如，某人喜欢电影 1237 和 551，你想推荐类似的电影，可以执行如下 Elasticsearch 查询获得推荐，将返回indicator 数组为 1237 和 551 的电影，即 1237=Seventh Seal（第七封印），551=Nightmare Before Christmas（圣诞夜惊魂）：

curl 'http://localhost:9200/bigmovie/film/_search?pretty' -d '

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

},

  "fields":["_id","title","genre"],

  "size":"8"

}'

上面查询 indicator 为 1237 或 551，并且不是 1237 或 551 的电影。下面示例使用 Sense 插件进行查询，右边是检索结果，推荐结果是 “A Man Named Pearl（这个是纪录片）” 和 “Used People（寡妇三弄）”。

控制相关性

全文检索引擎根据相关度排序，Elasticsearch 用 _score 字段表示文档的相关度分数（relevance score）。function_score 允许你查询时修改该分数。random_score 用一个种子变量使用散列生成分数。Elasticsearch 查询如下所示，random_score 函数用于把变量添加到检索结果，以便完成 dithering：

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

相关性抖动（dithering）有意地包含排名靠，但相关性较低的结果，以便拓展训练数据，提供给推荐引擎。如果没有 dithering，那么明天的训练数据仅仅是教模型今天已经知道的事情。增加 dithering，会帮助拓展推荐模型。如果模型给出的答案接近优秀的，那么 dithering 可以帮助找到正确答案。有效的 dithering 会减少今天的准确性，而改进明天的训练数据（和未来的性能，算法的准确性也属于性能的范畴），换句话说，为了让将来的推荐准确，需要减少过去对将来的影响。

总结

We showed in this tutorial how to use Apache Mahout and Elasticsearch with the MapR Sandbox to build a basic recommendation engine. You can go beyond a basic recommender and get even better results with a few simple additions to the design to add cross recommendation of items, which leverages a variety of interactions and items for making recommendations. You can find more information about these technologies here:

参考资料

若想学习更多关于推荐引擎的组件和逻辑，参看 "An Inside Look at the Components of a Recommendation Engine"，该文章详细描述了推荐引擎的架构、Mahout 协同过滤（collaborative filtering）和 Elasticsearch 检索引擎。

更多关于推荐引擎、机器学习和 Elasticsearch 的资源，如下所示：

Tutorial Category Reference:

posted @ 2016-05-24 10:44 船长&CAP 阅读(5209) 评论(0) 收藏举报

刷新页面返回顶部

船长&CAP

“0 + 1 = The World, 我们既愚蠢/也聪明/愚蠢的是/我们世界只有0和1/聪明的是/我们却用0和1描述了这个世界”