Recommender System — Collaborative filtering
1、概述
Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users.
通过收集和分析大量的用户行为、活动以及评分记录来发现跟该用户兴趣相似的其他用户,借由其他用户的行为记录来预测用户会喜欢什么东西。
A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself.
协同过滤的最大优势是它通过用户的行为来分析用户对事物的喜恶,而不需要算法去“理解”某个事物究竟是什么。
Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the k-nearest neighborhood (k-NN) approach and the Pearson Correlation.
有很多算法可以用来“测量”用户之间的兴趣相似程度,比如K-NN算法、皮尔逊相关系数等。
2、数据收集
进行协同过滤需要收集用户数据,收集用户数据的方式分为两种类型:显示收集(explicit data collection)以及隐式收集(implicit data collection)。也就是明着来,还有暗着来。
明着来的方法诸如:
Asking a user to rate an item on a sliding scale.
Asking a user to rank a collection of items from favorite to least favorite.
Presenting two items to a user and asking him/her to choose the better one of them.
Asking a user to create a list of items that he/she likes
暗着来的方式诸如:
Observing the items that a user views in an online store.
Analyzing item/user viewing times[12]
Keeping a record of the items that a user purchases online.
Obtaining a list of items that a user has listened to or watched on his/her computer.
Analyzing the user's social network and discovering similar likes and dislikes
个人比较中意暗着来的方式,不需要用户额外的工作。但是暗着来的方式常常也涉及到隐私问题,所以也有弊端。
3、数据分析方式
The recommender system compares the collected data to similar and dissimilar data collected from others and calculates a list of recommended items for the user.
通过将收集到的用户A的数据与之前收集到的其他跟用户A相似以及非相似的用户数据进行比较,来得出一个要推荐的物品清单。例子有:
One of the most famous examples of collaborative filtering is item-to-item collaborative filtering (people who buy x also buy y), an algorithm popularized by Amazon.com's recommender system.
比较著名的协同过滤的例子就是物品到物品的协同过滤,即“购买A的用户通常也购买B”,这是由亚马逊开始推广开来的算法。
Other examples include: 其他的算法还有
Last.fm recommends music based on a comparison of the listening habits of similar users.
Last.fm 通过比较相似用户的收听清单来为用户推荐音乐。
Facebook, MySpace, LinkedIn, and other social networks use collaborative filtering to recommend new friends, groups, and other social connections (by examining the network of connections between a user and their friends).
Facebook等SNS网络通过协同过滤来向用户推荐新朋友,其方式是检测用户的朋友圈来找出相似的用户群进行推荐。
4、协同过滤会遇到的问题
Collaborative filtering approaches often suffer from three problems: cold start(冷启动), scalability(可扩展性), and sparsity(稀疏性).
参考:Sanghack Lee and Jihoon Yang and Sung-Yong Park, Discovery of Hidden Similarity on Collaborative Filtering to Overcome Sparsity Problem, Discovery Science, 2007.
① Cold Start: These systems often require a large amount of existing data on a user in order to make accurate recommendations.
冷启动:推荐系统一般需要大量的已存在数据来进行精确推荐。冷启动在wikipedia中的定义里有这么一段话:
“it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.”那么在推荐系统中,冷启动的解决思路是:“In recommender systems, the cold start problem is often reduced by adopting a hybrid approach between content-based matching and collaborative filtering. New items (which have not yet received any ratings from the community) would be assigned a rating automatically, based on the ratings assigned by the community to other similar items. Item similarity would be determined according to the items' content-based characteristics” 也就是在没有用户评分的时候,自动根据相似的产品来给它预先赋予一个分数,而什么产品跟它是相似的呢?判断的方法则使用基于内容的算法。这样,就相当于混合了协同过滤和基于内容的方式了。
② Scalability: In many of the environments that these systems make recommendations in, there are millions of users and products. Thus, a large amount of computation power is often necessary to calculate recommendations.
可扩展性:在使用推荐系统的环境中,一般都存在大量的用户的商品数据,因此为了计算推荐列表,需要巨大的计算能力。
③ Sparsity:
The number of items sold on major e-commerce sites is extremely large.
The most active users will only have rated a small subset of the overall
database. Thus, even the most popular items have very few ratings.
在主要的电商网站上销售的商品非常多。即使是有一部分非常活跃的用户,也只能评价其中的某些商品而已,所以总体而言,商品的评价率是很低的。因此在计算时,就存在一个稀疏矩阵的计算问题。
A particular type of collaborative filtering algorithm uses matrix factorization, a low-rank matrix approximation technique.
为此,一种特殊类型的协同过滤算法采用矩阵分解,低秩矩阵逼近技术。
参考:
I. Markovsky, Low-Rank Approximation: Algorithms, Implementation, Applications, Springer, 2012, ISBN 978-1-4471-2226-5
Takács, G.; Pilászy, I.; Németh, B.; Tikk, D. (March 2009). "Scalable Collaborative Filtering Approaches for Large Recommender Systems". Journal of Machine Learning Research 10: 623–656
Rennie, J.; Srebro, N. (2005). "Fast Maximum Margin Matrix Factorization for Collaborative Prediction". In Luc De Raedt, Stefan Wrobel (PDF). Proceedings of the 22nd Annual International Conference on Machine Learning. ACM Press.