mahout学习笔记3

Understanding user-based recommendation

1.When recommendation goes wrong

选择合适的data set作为训练数据

2.When recommendation goes right

 

 Exploring the user-based recommender

1.The algorithm

找到相似的用户,再看他们都关注什么

2. mplementing the algorithm with GenericUserBasedRecommender

Mahout isn’t a single recommender engine, but an assortment of components that can be plugged together and customized to create an ideal recommender for a particular domain. Here the following components are assembled:
 Data model, implemented via DataModel
 User-user similarity metric, implemented via UserSimilarity
 User neighborhood definition, implemented via UserNeighborhood
 Recommender engine, implemented via a Recommender (here, GenericUserBasedRecommender)
Getting good results, and getting them fast, is inevitably a long process of experimen-
tation and refinement.

3. Fixed-size neighborhoods

越大越好???

数值设大会得到不相似的结果

4. Threshold-based neighborhood

3&4需要测试

 

Exploring similarity metrics 

 

1. Pearson correlation–based similarity

PearsonCorrelationSimilarity
The Pearson correlation is a number between –1 and 1 that measures the tendency of two series of numbers, paired up one-to-one, to move together. 

相关系数在-1,1之间,0表示不想关,接近-1表示负相关,接近1表示正相关

http://zh.wikipedia.org/wiki/%E7%9A%AE%E5%B0%94%E9%80%8A%E7%A7%AF%E7%9F%A9%E7%9B%B8%E5%85%B3%E7%B3%BB%E6%95%B0

两个变量之间的皮尔逊相关系数定义为两个变量之间的协方差标准差的商:

\rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y},

协方差两个变量的变化趋势是否一致

标准差反映个体间的离散程度

缺点:

(1) 没有考虑(take into account)用户间重叠的评分项数量对相似度的影响;(违反直觉的)
(2) 如果两个用户之间只有一个共同的评分项,相似度也不能被计算

Employing weighting

在mahout里对皮尔森相关性算法提供了扩展,可以传入权重来减少上述缺点1带来的问题

 

2.Defining similarity by Euclidean distance

EuclideanDistanceSimilarity

This implementation is based on the distance between users. This idea makes sense if you think of users as points in a space of many dimensions (as many dimen-
sions as there are items), whose coordinates are preference values. This similarity metric computes the Euclidean distance d between two such user points.2 This value
alone doesn’t constitute a valid similarity metric, because larger values would mean more-distant, and therefore less similar, users. The value should be smaller when
users are more similar. Therefore, the implementation actually returns 1 / (1+d).
Refer to table 4.2, which illustrates this computation. You can verify that when the distance is 0 (users have identical preferences) the result is 1, decreasing to 0 as d
increases. This similarity metric never returns a negative value, but larger values still mean more similarity.

欧几里德要求至少有一个共同评分项

 

3.Adapting the cosine measure similarity

CosineSimilarity

The cosine measure similarity is another similarity metric that depends on envisioning user preferences as points in space. Hold in mind the image of user preferences as
points in an n-dimensional space. Now imagine two lines from the origin, or point (0,0,...,0), to each of these two points. When two users are similar, they’ll have similar
ratings, and so will be relatively close in space—at least, they’ll be in roughly the same direction from the origin. The angle formed between these two lines will be relatively
small. In contrast, when the two users are dissimilar, their points will be distant, and likely in different directions from the origin, forming a wide angle.

 

4.Defining similarity by relative rank with the Spearman correlation

SpearmanCorrelationSimilarity

The Spearman correlation is an interesting variant on the Pearson correlation, for our purposes. Rather than compute a correlation based on the original preference values,
it computes a correlation based on the relative rank of preference values. Imagine that, for each user, their least-preferred item’s preference value is overwritten with a 1.
Then the next-least-preferred item’s preference value is changed to 2, and so on. To illustrate this, imagine that you were rating movies and gave your least-preferred
movie one star, the next-least favorite two stars, and so on. Then, a Pearson correlation is computed on the transformed values. This is the Spearman correlation.

 

斯皮尔曼相关返回-1或1

只有一个共同评分项不能计算

UserSimilarity similarity = new CachingUserSimilarity(new SpearmanCorrelationSimilarity(model), model);

 

5.Ignoring preference values in similarity with the Tanimoto coefficient

TanimotoCoefficientSimilarity

交集的比率

 

6.Computing smarter similarity with a log-likelihood test

LogLikelihoodSimilarity

用户的共同偏好不同寻常对计算相似性的影响更大。

 

Item-based recommendation

They do have nota-
bly different properties. For instance, the running time of an item-based recommender scales up as the number of items increases, whereas a user-based
recommender’s running time goes up as the number of users increases.
This suggests one reason that you might choose an item-based recommender: if the number of items is relatively low compared to the number of users, the performance advantage could be significant.

基于item的推荐随着item的增加运行时间增加

基于user的推荐随着user的增加运行时间增加

PearsonCorrelationSimilarity

EuclideanDistanceSimilarity

TanimotoCoefficientSimilarity

LogLikelihoodSimilarity

 

其它的推荐算法

SVDRecommender

KnnItemBasedRecommender

 

posted on 2014-09-19 18:32  ukouryou  阅读(143)  评论(0编辑  收藏  举报

导航