推荐系统(四)

(原创文章,转载请注明出处!)

用户对物品的推荐数据通常形成一个巨大的矩阵,而且通常用户的数量比物品的数量多,可以通过SVD(奇异值分解)来将矩阵分解,减少计算中使用的数据量,降低计算的复杂度。假设数据R是m x n矩阵,m个用户,n个物品,通过奇异值分解,R=U∑VT。那么将R投影到低维的k(k < min(m,n))空间:Rk=RTUkk,RT是R的转置 n x m矩阵, Uk是m x k矩阵,∑k是 k x k 对角阵,所以投影完成后的矩阵Rk是 n x k 矩阵,每一行代表一个物品。

计算过程:

1. 对原始数据进行normalization(z-Score)

2. 对normalization后的数据矩阵进行SVD分解,将数据矩阵投影到新的低维空间

3. 使用IBCF来计算推荐结果

实现代码如下:

 1 ## Decompose the rating matrix with SVD and project the rating matix 
 2 ## to lower dimension space. Find the top n items as the item recommendation 
 3 ## list with the Item-Based Collaborative Filtering algorithm over the 
 4 ## lower dimension data.
 5 ## Args :
 6 ##      x  -  a matrix, contain all rating reslut. 
 7 ##            Each colum is the rating by one user, each row is the rating of one movie.
 8 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
 9 ##      userI - index of specified user
10 ##      k  -  k nearest neigbour of useriI
11 ##      n  -  top n items that will be recommended to user-I
12 ##      pc_threshold  -  principal component threshold
13 ## Returns :
14 ##      a list, contains recommendation result
15 svdRecommendationIBCF <- function(x, userI, k, n, pc_threshold=0.9)
16 {
17     # todo: how to calculate the Pearson correlation coefficient between two vectors
18     # sum((x - u_x)*(y - u_y)) / (sd_x*sd_y)
19     
20     x[which(is.na(x))] <- 0
21     ## normalize the data
22     normlizedResult <- zScoreNormalization( x )
23     x <- t( normlizedResult$xNormalized )
24     
25     ## svd decomposition
26     svd_x <- svd(x)
27     # find the top-k singular value
28     numTopSV <- 0
29     for(sv in svd_x$d) {
30         numTopSV <- numTopSV + 1
31         if ( (sum(svd_x$d[1:numTopSV]) / sum(svd_x$d)) >= pc_threshold ) {
32             break
33         }
34     }
35     # project the rating data to lower dimension
36     # x_lowDim is a n-by-numTopSV matrix
37     # n is the number of items
38     # numTopSV is less than or equal to min(m , n)
39     x_lowDim <- t(x) %*% svd_x$u[,1:numTopSV] %*% diag(svd_x$d[1:numTopSV])
40     
41     
42     ## predicting the rating of user-I's un-rated items
43     unRatedIdx <- which(x[,userI] == 0)
44     ratedIdx <- which(x[,userI] != 0)
45     ratingOfUnRatedItems <- numeric( dim(x)[1] )
46     for (i in unRatedIdx) {        
47         # calculate the Pearson correlation coefficient to each item
48         itemSim <- cor( x = x_lowDim[i,], y = t(x_lowDim[ratedIdx,]), use = "everything", method = "pearson" )
49         itemSim <- 0.5 + 0.5*itemSim # keep the similarity in [0,1]
50         # find the k nearest items to item-i
51         KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 
52                                   MARGIN=1,  # apply the function to each row
53                                   FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
54                                 )
55         KSimilarItemIdx <- as.vector(KSimilarItemIdx)                              
56 
57         r <- x[ratedIdx,]
58         ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] )   
59                                    /   sum( itemSim[KSimilarItemIdx] )
60         if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) {
61             next
62         }
63         ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 
64                                                             normlizedResult$meanOfcol[i], 
65                                                             normlizedResult$sdOfcol[i] )
66     }
67     
68     ## find the Top-N items
69     topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 
70                      FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), n )  )
71     topnIdx <- as.vector(topnIdx)
72     recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx)
73     return( recommendList )
74 }

 

以上代码中使用到的zScoreNormalization,与zScoreNormalizationInverse函数在文章推荐系统(三)中有给出。

代码与推荐系统(三)中给出的IBCF代码的主要差别是在24-38行使用SVD对评分矩阵进行了分解,并将原始的评分矩阵投影到低维空间,47行在计算物品间相似性时使用了低维矩阵,可以在一定程度上降低计算的复杂度。

 

posted @ 2014-10-09 09:50  activeshj  阅读(199)  评论(0编辑  收藏  举报