ALS的Spark实现
初始化数据集和Spark环境---->切分测试机和检验集----->训练ALS模型----->验证结果----->检验满足结果---->直接推荐商品,否则继续训练ALS模型
2.数据集的含义
Rating是固定的ALS输入格式,要求是一个元组类型的数据,其中数值分别是如下的[Int,Int,Double],在建立数据集的时候,用户名和物品名需要采用数值代替
/** * A more compact class to represent a rating than Tuple3[Int, Int, Double]. */ @Since("0.8.0") case class Rating @Since("0.8.0") ( @Since("0.8.0") user: Int, @Since("0.8.0") product: Int, @Since("0.8.0") rating: Double)
如下:第一列位用户编号,第二列位产品编号,第三列的评分Rating为Double类型
3.ALS的测试数据集源代码解读
3.1ALS类的所有字段如下
@Since("0.8.0") class ALS private ( private var numUserBlocks: Int, private var numProductBlocks: Int, private var rank: Int, private var iterations: Int, private var lambda: Double, private var implicitPrefs: Boolean, 使用显式反馈ALS变量或隐式反馈 private var alpha: Double, ALS隐式反馈变化率用于控制每次拟合修正的幅度 private var seed: Long = System.nanoTime() ) extends Serializable with Logging {
3.2 ALS.train方法
/** * Train a matrix factorization model given an RDD of ratings given by users to some products, * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the * product of two lower-rank matrices of a given rank (number of features). To solve for these * features, we run a given number of iterations of ALS. This is done using a level of * parallelism given by `blocks`. * * @param ratings RDD of (userID, productID, rating) pairs * @param rank number of features to use * @param iterations number of iterations of ALS (recommended: 10-20) * @param lambda regularization factor (recommended: 0.01) * @param blocks level of parallelism to split computation into 将并行度分解为等级 * @param seed random seed 随机种子 */ @Since("0.9.1") def train( ratings: RDD[Rating], //RDD序列由用户ID 产品ID和评分组成 rank: Int, //模型中的隐藏因子数目 iterations: Int, //算法迭代次数 lambda: Double, //ALS正则化参数 blocks: Int, //块 seed: Long ): MatrixFactorizationModel = { new ALS(blocks, blocks, rank, iterations, lambda, false, 1.0, seed).run(ratings) }
3.3 基于ALS算法的协同过滤推荐
package com.bigdata.demo import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating /** * Created by SimonsZhao on 3/30/2017. * ALS最小二乘法 */ object CollaborativeFilter { def main(args: Array[String]) { //设置环境变量 val conf=new SparkConf().setMaster("local").setAppName("CollaborativeFilter ") //实例化环境 val sc = new SparkContext(conf) //设置数据集 val data =sc.textFile("E:/scala/spark/testdata/ALSTest.txt") //处理数据 val ratings=data.map(_.split(' ') match{ //数据集的转换 case Array(user,item,rate) => //将数据集转化为专用的Rating Rating(user.toInt,item.toInt,rate.toDouble) }) //设置隐藏因子 val rank=2 //设置迭代次数 val numIterations=2 //进行模型训练 val model =ALS.train(ratings,rank,numIterations,0.01) //为用户2推荐一个商品 val rs=model.recommendProducts(2,1) //打印结果 rs.foreach(println) } }
4.测试及分析
根据结果分析为第2个用户推荐了编号为15的商品,预测评分为3.99
5.基于用户的推荐源代码(mllib)
注释的部分翻译:
用户向用户推荐产品
num返回多少产品。 返回的数字可能少于此值。
[[评分]]对象,每个对象包含给定的用户ID,产品ID和
评分字段中的“得分”。 每个代表一个推荐的产品,并且它们被排序
按分数,减少。 第一个返回的是预测最强的一个
推荐给用户。 分数是一个不透明的值,表示强列推荐的产品。
/** * Recommends products to a user. * * @param user the user to recommend products to * @param num how many products to return. The number returned may be less than this. * @return [[Rating]] objects, each of which contains the given user ID, a product ID, and a * "score" in the rating field. Each represents one recommended product, and they are sorted * by score, decreasing. The first returned is the one predicted to be most strongly * recommended to the user. The score is an opaque value that indicates how strongly * recommended the product is. */ @Since("1.1.0") def recommendProducts(user: Int, num: Int): Array[Rating] = MatrixFactorizationModel.recommend(userFeatures.lookup(user).head, productFeatures, num) .map(t => Rating(user, t._1, t._2))
6.基于物品的推荐源代码(mllib)
注释的部分翻译:
推荐用户使用产品,也就是说,这将返回最有可能的用户对产品感兴趣
每个都包含用户ID,给定的产品ID和评分字段中的“得分”。
每个代表一个推荐的用户,并且它们被排序按得分,减少。
第一个返回的是预测最强的一个推荐给产品。
分数是一个不透明的值,表示强烈推荐给用户。
/** * Recommends users to a product. That is, this returns users who are most likely to be * interested in a product. * * @param product the product to recommend users to 给用户推荐的产品 * @param num how many users to return. The number returned may be less than this. 返回个用户的个数 * @return [[Rating]] objects, each of which contains a user ID, the given product ID, and a * "score" in the rating field. Each represents one recommended user, and they are sorted * by score, decreasing. The first returned is the one predicted to be most strongly * recommended to the product. The score is an opaque value that indicates how strongly * recommended the user is. */ @Since("1.1.0") def recommendUsers(product: Int, num: Int): Array[Rating] = MatrixFactorizationModel.recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t => Rating(t._1, product, t._2))