ALS的Spark实现

1.ALS算法流程：

　　初始化数据集和Spark环境---->切分测试机和检验集----->训练ALS模型----->验证结果----->检验满足结果---->直接推荐商品，否则继续训练ALS模型

2.数据集的含义

　　Rating是固定的ALS输入格式，要求是一个元组类型的数据，其中数值分别是如下的[Int,Int,Double],在建立数据集的时候，用户名和物品名需要采用数值代替

/**
 * A more compact class to represent a rating than Tuple3[Int, Int, Double].
 */
@Since("0.8.0")
case class Rating @Since("0.8.0") (
    @Since("0.8.0") user: Int,
    @Since("0.8.0") product: Int,
    @Since("0.8.0") rating: Double)

　　如下：第一列位用户编号，第二列位产品编号，第三列的评分Rating为Double类型

3.ALS的测试数据集源代码解读

3.1ALS类的所有字段如下

@Since("0.8.0")
class ALS private (
    private var numUserBlocks: Int,
    private var numProductBlocks: Int,
    private var rank: Int,
    private var iterations: Int,
    private var lambda: Double,
    private var implicitPrefs: Boolean,  使用显式反馈ALS变量或隐式反馈
    private var alpha: Double,    ALS隐式反馈变化率用于控制每次拟合修正的幅度
    private var seed: Long = System.nanoTime()
  ) extends Serializable with Logging {

3.2 ALS.train方法

/**
   * Train a matrix factorization model given an RDD of ratings given by users to some products,
   * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the
   * product of two lower-rank matrices of a given rank (number of features). To solve for these
   * features, we run a given number of iterations of ALS. This is done using a level of
   * parallelism given by `blocks`.
   *
   * @param ratings    RDD of (userID, productID, rating) pairs
   * @param rank       number of features to use  
   * @param iterations number of iterations of ALS (recommended: 10-20)
   * @param lambda     regularization factor (recommended: 0.01)
   * @param blocks     level of parallelism to split computation into  将并行度分解为等级
   * @param seed       random seed  随机种子
   */
  @Since("0.9.1")
  def train(
      ratings: RDD[Rating], //RDD序列由用户ID 产品ID和评分组成
      rank: Int,    //模型中的隐藏因子数目
      iterations: Int,  //算法迭代次数
      lambda: Double,  //ALS正则化参数
      blocks: Int,   //块
      seed: Long
    ): MatrixFactorizationModel = {
    new ALS(blocks, blocks, rank, iterations, lambda, false, 1.0, seed).run(ratings)
  }

3.3 基于ALS算法的协同过滤推荐

package com.bigdata.demo

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

/**
  * Created by SimonsZhao on 3/30/2017.
  * ALS最小二乘法
  */
object CollaborativeFilter {

  def main(args: Array[String]) {
    //设置环境变量
     val conf=new SparkConf().setMaster("local").setAppName("CollaborativeFilter ")
    //实例化环境
     val sc = new SparkContext(conf)
    //设置数据集
     val data =sc.textFile("E:/scala/spark/testdata/ALSTest.txt")
    //处理数据
     val ratings=data.map(_.split(' ') match{
      //数据集的转换
      case Array(user,item,rate) =>
        //将数据集转化为专用的Rating
        Rating(user.toInt,item.toInt,rate.toDouble)
    })
    //设置隐藏因子
     val rank=2
    //设置迭代次数
     val numIterations=2
    //进行模型训练
     val model =ALS.train(ratings,rank,numIterations,0.01)
    //为用户2推荐一个商品
     val rs=model.recommendProducts(2,1)
    //打印结果
     rs.foreach(println)
  }

}

4.测试及分析

根据结果分析为第2个用户推荐了编号为15的商品，预测评分为3.99

5.基于用户的推荐源代码(mllib)

注释的部分翻译：

用户向用户推荐产品

num返回多少产品。返回的数字可能少于此值。

[[评分]]对象，每个对象包含给定的用户ID，产品ID和
评分字段中的“得分”。每个代表一个推荐的产品，并且它们被排序
按分数，减少。第一个返回的是预测最强的一个
推荐给用户。分数是一个不透明的值，表示强列推荐的产品。

/**
   * Recommends products to a user.
   *
   * @param user the user to recommend products to
   * @param num how many products to return. The number returned may be less than this.
   * @return [[Rating]] objects, each of which contains the given user ID, a product ID, and a
   *  "score" in the rating field. Each represents one recommended product, and they are sorted
   *  by score, decreasing. The first returned is the one predicted to be most strongly
   *  recommended to the user. The score is an opaque value that indicates how strongly
   *  recommended the product is.
   */
  @Since("1.1.0")
  def recommendProducts(user: Int, num: Int): Array[Rating] =
    MatrixFactorizationModel.recommend(userFeatures.lookup(user).head, productFeatures, num)
      .map(t => Rating(user, t._1, t._2))

6.基于物品的推荐源代码(mllib)

注释的部分翻译：

推荐用户使用产品,也就是说，这将返回最有可能的用户对产品感兴趣

每个都包含用户ID，给定的产品ID和评分字段中的“得分”。

每个代表一个推荐的用户，并且它们被排序按得分，减少。

第一个返回的是预测最强的一个推荐给产品。

分数是一个不透明的值，表示强烈推荐给用户。

/**
   * Recommends users to a product. That is, this returns users who are most likely to be
   * interested in a product.
   *
   * @param product the product to recommend users to   给用户推荐的产品
   * @param num how many users to return. The number returned may be less than this. 返回个用户的个数
   * @return [[Rating]] objects, each of which contains a user ID, the given product ID, and a
   *  "score" in the rating field. Each represents one recommended user, and they are sorted
   *  by score, decreasing. The first returned is the one predicted to be most strongly
   *  recommended to the product. The score is an opaque value that indicates how strongly
   *  recommended the user is.
   */
  @Since("1.1.0")
  def recommendUsers(product: Int, num: Int): Array[Rating] =
    MatrixFactorizationModel.recommend(productFeatures.lookup(product).head, userFeatures, num)
      .map(t => Rating(t._1, product, t._2))

posted on 2019-08-12 10:41 农夫三拳有點疼阅读(138) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

农夫三拳有點疼

ALS的Spark实现

导航