推荐系统(五)
(原创文章,转载请注明出处!)
协同过滤算法(Collaborative Filtering)基于的根据是,类似的人会喜好类似的物品。如果将用户进行聚类,当来了新用户,将新用户归到相应的类,用类的评分结果来形成新用户的推荐结果。在文章 Collaborative Filtering Based on Iterative Principal Component Analysis 中给出了一种使用主成分分解技术对评分数据进行分解,然后对用户进行聚类的方法,算法的流程如下:
假设有M个用户对n个物品进行了评价,形成了 M x n的评分矩阵A。
1. 对矩阵A中的缺失值aij(用户没有对相应的物品进行评价)进行填充:使用i行和j列的平均值。
2. 对矩阵A进行均值normalization :将每个评价减去其对应的列(物品)平均值。
3. 对矩阵A进行SVD分解 :A = U∑VT,选择前k个最大的奇异值,通过Uk∑kVkT = A',得到原矩阵A的近似矩阵A'
4. 将矩阵A中最初的缺失值,使用A'中相应位置的值替代
5. 重复2、3、4直到收敛。收敛条件:相邻两次(l-1与l)迭代的缺失值的估计值的平方和之差与第(l-1)次的缺失值的估计值的平方和的比值小于某个常数(比如:10-10)
(对最终的评分还需要进行均值normalization的逆操作,即:加上物品平均值)
通过以上的过程就将已有用户缺失的评分值预测出来。此外还通过上述的步骤的到主成分:Uk∑k,一个M x k的矩阵,可以将这个矩阵作为训练数据使用K均值聚类来对用户进行聚类(也可以采用其他的聚类方法,如:RRC)。
聚类的类别数量选择条件:|Sk - Sk+1| / S2≤ ε , ε是给定的常数,比如:0.001
Sk是k类的情况下,聚类完成后,各个样本点到各自的聚类中心的距离平方和
Sk+1是k+1类的情况下,聚类完成后,各个样本点到各自的聚类中心的距离平方和
S2是2类的情况下,聚类完成后,各个样本点到各自的聚类中心的距离平方和
当来了一个新用户:
1. 选择最后的m-1行,形成一个(m-1) x n的子矩阵; (m < M)
2. 对一个新用户,将其对n个物品的评价向量添加到子矩阵的最后一行,形成m x n矩阵B
3. 对矩阵B,使用上述的迭代算法,将最后一行中的缺失值填上
4. 然后,计算出最后一行的PC
5. 计算最后一行PC到各个聚类中心的距离,决定新用户属于哪个类
6. 利用所选择类的平均值,来确实最后一行的缺失值,并形成最终的推荐结果
算法实现:
总体上来看,整个算法需要做的事情有:
1. 通过迭代的SVD分解,将已存在的用户对物品的评价缺失值,给补充上,并且计算出主成分
2. 通过一个聚类算法(K均值聚类),以原始评分数据的主成分为输入,进行用户聚类
3. 对新用户,使用迭代的SVD分解将新用户的缺失评分补充上,并计算新用户评分矩阵的主成分
4. 通过计算新用户的主成分与各聚类中心的距离,将新用户归类,使用类的评均值,作为新用户缺失评分的最终估计值
5. 寻找新用户缺失评分的Top-n,作为对新用户的推荐结果
R代码如下:
1 ## normalize a vector with mean normanization ( x-u ) 2 ## substract column mean from each element 3 ## Args : 4 ## x - a matrix 5 ## Returns : 6 ## a list contains, mean of each colum, 7 ## normalized x 8 meanNormalization <- function(x) 9 { 10 meanOfcol <- numeric(dim(x)[2]) 11 sdOfcol <- numeric(dim(x)[2]) 12 for (i in 1:dim(x)[2]) { 13 t <- x[,i] 14 idx <- which(t != 0) 15 if (length(idx) < 1) { 16 meanOfcol[i] <- NA 17 next 18 } 19 meanOfcol[i] <- mean(t[idx]) 20 x[idx,i] <- t[idx] - mean(t[idx]) # mean normalization 21 } 22 23 return ( list(meanOfcol = meanOfcol, xNormalized=x) ) 24 } 25 ## Args : 26 ## x - a vector 27 ## u - a real number, mean value 28 ## Returns : 29 ## a vector 30 meanNormalizationInverse <- function(x, u) 31 { 32 return (x + u) 33 } 34 35 ## iterate with SVD to get the Principle Component of users and fill the missing rating 36 ## Args : 37 ## x - a matrix, contain all rating reslut. 38 ## Each row is the rating made by one user, each colum is the rating of one item. 39 ## If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA. 40 ## pc_threshold - threshold to choose the principal components 41 ## iterate_threshold - svd iteration threshold 42 ## Returns : 43 ## a list, contains : a matrix, principle component of users; 44 ## a matrix, contains all rating between all users and all items, no missing value . 45 iterateSVD <- function( x, pc_threshold = 0.9, iterate_threshold = 10e-10 ) 46 { 47 A <- x 48 ## fill the missing value with the average of ith row and jth colum 49 A[which(is.na(A))] <- 0 50 meanOfColumn <- colMeans(A) 51 meanOfRow <- rowMeans(A) 52 idx <- which(is.na(x)) 53 for (i in idx) { 54 i <- i %% dim(x)[1] # remainder 55 j <- j %/% dim(x)[1] + 1 56 if (i == 0) { 57 i <- 12 58 } 59 A[i,j] <- mean( c(meanOfRow[i], meanOfColumn[j]) ) 60 } 61 62 ## normalize the data by column 63 normlizedResult <- meanNormalization( A ) 64 A <- normlizedResult$xNormalized 65 66 ## iterate util convergence 67 lastSquareSumLast <- 0.001 * iterate_threshold # initialized by a small enough value 68 which( TRUE ) { 69 # svd decomposition 70 svd_A <- svd(A) 71 72 # find the top-k singular value 73 numTopSV <- 0 74 for(sv in svd_A$d) { 75 numTopSV <- numTopSV + 1 76 if ( (sum(svd_A$d[1:numTopSV]) / sum(svd_A$d)) >= pc_threshold ) { 77 break 78 } 79 } 80 81 # approximation of A 82 A_appro <- svd_A$u[,1:numTopSV] %*% diag(svd_A$d[1:numTopSV]) %*% t(svd_A$v[,1:numTopSV]) 83 84 # filling the missing value with the corresponding value in A_appro 85 A[which(is.na(x))] <- A_appro[which(is.na(x))] 86 87 # checking convergence, or not 88 if ( ( (sum(A_appro[which(is.na(x))]^2) - lastSquareSumLast) / lastSquareSumLast ) 89 < iterate_threshold ) { 90 break 91 } 92 93 lastSquareSumLast <- sum(A_appro[which(is.na(x))]^2) 94 } 95 96 ## calculate the principle component of user 97 pcOfUsers <- A %*% svd_A$v[,1:numTopSV] 98 99 normalizedA <- A 100 101 ## reverse the normalization 102 for (i in 1:dim(A)[2]) { 103 A[,i] <- meanNormalizationInverse(A[,i], normlizedResult$meanOfcol) 104 } 105 106 return ( list( pcOfUsers = pcOfUsers, fullRatingResult = A) ) 107 } 108 109 ## cluster the users with k-means cluster method 110 ## Args : 111 ## pc - a matrix, principle component of users, each row is one user 112 ## clusterThreshold - a constant, which is used to choose the 113 ## best user cluster number 114 ## Returns : 115 ## a list, contains : user k-means cluster object 116 ## user cluster number 117 clusterUser <- function(pc, clusterThreshold = 0.001) 118 { 119 clusterNum <- 2 120 uc2 <- kmeans(x, centers = clusterNum) 121 lastKM <- uc2 122 123 ## to make the best choice of cluster number 124 ## iterate with increasing cluster number until convergence 125 while( TRUE ) { 126 clusterNum <- clusterNum + 1 127 uci <- kmeans(x, centers = clusterNum) 128 129 # check convergence 130 if ( ( abs(lastKM$tot.withinss - uci$tot.withinss) / uc2$tot.withinss ) 131 >= clusterThreshold ) { 132 break 133 } 134 lastKM <- uci 135 } 136 137 return ( list(userCluster = uci, clusterNum = clusterNum) ) 138 } 139 140 ## get the recomendation list to new user 141 ## Args : 142 ## ur - a vector, rating of items from a new user 143 ## x - a matrix, contain all rating reslut. 144 ## Each row is the rating made by one user, each colum is the rating of one item. 145 ## If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA. 146 ## m - a integer, number of old users that will be used to fill the 147 ## missing value of vector ur with iterate svd method 148 ## topn - a integer, how many items will be recommended to the new user. 149 ## Returns : 150 ## a list, contains recommendation result 151 recommendToNewUser <- function(ur, x, m, topn) 152 { 153 iterSVD_x <- iterateSVD(x) 154 cluster_x <- clusterUser( iterSVD_x$pcOfUsers ) 155 156 A <- iterSVD_x$fullRatingResult 157 B <- rbind(A[(dim(A)[1] - m + 1) : dim(A)[1],], ur) 158 159 iterSVD_B <- iterateSVD(B) 160 pc_ur <- iterSVD_B$pcOfUsers[m+1, ] # the principle component of new user 161 162 ## choose the cluster of new user 163 clusterCenter <- cluster_x$userCluster$centers 164 distToCC <- numeric(cluster_x$clusterNum) 165 for (i in 1:cluster_x$clusterNum) { 166 distToCC[1] <- sqrt( sum( (pc_ur - clusterCenter[i, ]) ^2 ) ) 167 } 168 # pick the minimum one 169 clusterOfNU <- which( distToCC == min(distToCC) ) 170 171 ## calculate the average value to fill the missing value in ur 172 idx <- which( cluster_x$cluster == clusterOfNU) 173 ratingOfNUCluster <- A[idx,] 174 idx_missing <- which(is.na(ur)) 175 for (j in idx_missing) { 176 ur[j] <- mean( ratingOfNUCluster[,j] ) 177 } 178 179 ## pick the top-n recommendation items 180 pickTopn <- numeric( length(ur) ) 181 pickTopn <- ur[idx_missing] 182 topnIdx <- apply( matrix(pickTopn,nrow=1), MARGIN=1, 183 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), topn ) ) 184 topnIdx <- as.vector(topnIdx) 185 recommendList <- list(ratingResult = pickTopn[topnIdx], topnIndex = topnIdx) 186 return( recommendList ) 187 }