Mahout SlopeOne实现(一)
2013-02-06 18:37 Polarisary 阅读(1270) 评论(0) 编辑 收藏 举报SlopeOne的基本思想是用均值化的来掩盖个体的打分差异,使用加权平均估算未评分的物品的评分。今天用了大半天的时间看了下mahout中SlopeOne的实现,它是初始化时计算所有的两个物品之间的评分均值。存在内存中(map:具体是FastByIDMap),这样,当要计算某个用户对某个未评分的商品的评分时,可以直接从内存中取均值。如下:
评分 | 物品1 | 物品2 | 物品3 |
用户1 | 5 | 2 | 4 |
用户2 | 4 | ? | 5 |
用户3 | 2 | 1 | ? |
要计算用户3对物品3的评分,(((5-4)+(4-5))/2+2 + ((2-4)/1)+1)/2=1,初始化所做的是计算:物品1与物品2的评分均值((5-4)+(4-5))/2)、物品1与物品3的评分均值((2-4)/1)……
实现如下:
1 /** 2 * <p> 3 * Creates a default (weighted) based on the given {@link DataModel}. 4 * </p> 5 */ 6 public SlopeOneRecommender(DataModel dataModel) throws TasteException { 7 this(dataModel, 8 Weighting.WEIGHTED, 9 Weighting.WEIGHTED, 10 new MemoryDiffStorage(dataModel, Weighting.WEIGHTED, Long.MAX_VALUE)); 11 }
MemoryDiffStorage是具体实现计算并存储均值的实现:
1 private void buildAverageDiffs() throws TasteException { 2 log.info("Building average diffs..."); 3 try { 4 buildAverageDiffsLock.writeLock().lock(); 5 averageDiffs.clear(); 6 long averageCount = 0L; 7 //获取所用用户评分 8 LongPrimitiveIterator it = dataModel.getUserIDs(); 9 while (it.hasNext()) { 10 //计算每个用户评过分的商品两两之间评分均值 11 averageCount = processOneUser(averageCount, it.nextLong()); 12 } 13 //剪枝操作 刪除均值小於1的結點 14 pruneInconsequentialDiffs(); 15 //更新商品總數 16 updateAllRecommendableItems(); 17 18 } finally { 19 buildAverageDiffsLock.writeLock().unlock(); 20 } 21 }
1 private long processOneUser(long averageCount, long userID) throws TasteException { 2 log.debug("Processing prefs for user {}", userID); 3 // Save off prefs for the life of this loop iteration 4 //找出此用户打过分的商品及评分 5 PreferenceArray userPreferences = dataModel.getPreferencesFromUser(userID); 6 int length = userPreferences.length(); 7 for (int i = 0; i < length - 1; i++) { 8 float prefAValue = userPreferences.getValue(i); 9 long itemIDA = userPreferences.getItemID(i); 10 FastByIDMap<RunningAverage> aMap = averageDiffs.get(itemIDA); 11 if (aMap == null) { 12 aMap = new FastByIDMap<RunningAverage>(); 13 averageDiffs.put(itemIDA, aMap); 14 } 15 for (int j = i + 1; j < length; j++) { 16 // This is a performance-critical block 17 long itemIDB = userPreferences.getItemID(j); 18 RunningAverage average = aMap.get(itemIDB); 19 if (average == null && averageCount < maxEntries) { 20 average = buildRunningAverage(); 21 aMap.put(itemIDB, average); 22 averageCount++; 23 } 24 if (average != null) { 25 average.addDatum(userPreferences.getValue(j) - prefAValue); 26 } 27 } 28 RunningAverage itemAverage = averageItemPref.get(itemIDA); 29 if (itemAverage == null) { 30 itemAverage = buildRunningAverage(); 31 averageItemPref.put(itemIDA, itemAverage); 32 } 33 itemAverage.addDatum(prefAValue); 34 } 35 return averageCount; 36 }
processOneUser方法是具体实现,看了很长时间才看明白。averageDiffs是FastByIDMap<FastByIDMap<RunningAverage>>,实际上是<itemID:< <itemID>:均值>,------>>。因为是按用户遍历计算的,所以实际上在processOneUser方法上同一对商品的评分均值只改变(或创建)一次。搞清楚数据结构就好理解了。