代码改变世界

Mahout SlopeOne实现(一)

2013-02-06 18:37  Polarisary  阅读(1270)  评论(0编辑  收藏  举报

SlopeOne的基本思想是用均值化的来掩盖个体的打分差异,使用加权平均估算未评分的物品的评分。今天用了大半天的时间看了下mahout中SlopeOne的实现,它是初始化时计算所有的两个物品之间的评分均值。存在内存中(map:具体是FastByIDMap),这样,当要计算某个用户对某个未评分的商品的评分时,可以直接从内存中取均值。如下:

评分 物品1 物品2 物品3
用户1 5 2 4
用户2 4 5
用户3 2 1

要计算用户3对物品3的评分,(((5-4)+(4-5))/2+2  + ((2-4)/1)+1)/2=1,初始化所做的是计算:物品1与物品2的评分均值((5-4)+(4-5))/2)、物品1与物品3的评分均值((2-4)/1)……

实现如下:

 1 /**
 2    * <p>
 3    * Creates a default (weighted)  based on the given {@link DataModel}.
 4    * </p>
 5    */
 6   public SlopeOneRecommender(DataModel dataModel) throws TasteException {
 7     this(dataModel,
 8          Weighting.WEIGHTED,
 9          Weighting.WEIGHTED,
10          new MemoryDiffStorage(dataModel, Weighting.WEIGHTED, Long.MAX_VALUE));
11   }

MemoryDiffStorage是具体实现计算并存储均值的实现:

 1 private void buildAverageDiffs() throws TasteException {
 2     log.info("Building average diffs...");
 3     try {
 4       buildAverageDiffsLock.writeLock().lock();
 5       averageDiffs.clear();
 6       long averageCount = 0L;
 7       //获取所用用户评分
 8       LongPrimitiveIterator it = dataModel.getUserIDs();
 9       while (it.hasNext()) {
10         //计算每个用户评过分的商品两两之间评分均值
11         averageCount = processOneUser(averageCount, it.nextLong());
12       }
13       //剪枝操作 刪除均值小於1的結點
14       pruneInconsequentialDiffs();
15       //更新商品總數
16       updateAllRecommendableItems();
17       
18     } finally {
19       buildAverageDiffsLock.writeLock().unlock();
20     }
21   }
 1 private long processOneUser(long averageCount, long userID) throws TasteException {
 2     log.debug("Processing prefs for user {}", userID);
 3     // Save off prefs for the life of this loop iteration
 4     //找出此用户打过分的商品及评分
 5     PreferenceArray userPreferences = dataModel.getPreferencesFromUser(userID);
 6     int length = userPreferences.length();
 7     for (int i = 0; i < length - 1; i++) {
 8       float prefAValue = userPreferences.getValue(i);
 9       long itemIDA = userPreferences.getItemID(i);
10       FastByIDMap<RunningAverage> aMap = averageDiffs.get(itemIDA);
11       if (aMap == null) {
12         aMap = new FastByIDMap<RunningAverage>();
13         averageDiffs.put(itemIDA, aMap);
14       }
15       for (int j = i + 1; j < length; j++) {
16         // This is a performance-critical block
17         long itemIDB = userPreferences.getItemID(j);
18         RunningAverage average = aMap.get(itemIDB);
19         if (average == null && averageCount < maxEntries) {
20           average = buildRunningAverage();
21           aMap.put(itemIDB, average);
22           averageCount++;
23         }
24         if (average != null) {
25           average.addDatum(userPreferences.getValue(j) - prefAValue);
26         }
27       }
28       RunningAverage itemAverage = averageItemPref.get(itemIDA);
29       if (itemAverage == null) {
30         itemAverage = buildRunningAverage();
31         averageItemPref.put(itemIDA, itemAverage);
32       }
33       itemAverage.addDatum(prefAValue);
34     }
35     return averageCount;
36   }

processOneUser方法是具体实现,看了很长时间才看明白。averageDiffs是FastByIDMap<FastByIDMap<RunningAverage>>,实际上是<itemID:< <itemID>:均值>,------>>。因为是按用户遍历计算的,所以实际上在processOneUser方法上同一对商品的评分均值只改变(或创建)一次。搞清楚数据结构就好理解了。