Spark ML 之 KMeans算法的应用实操——用户分群召回推荐算法
一、需求:
现有customers,orders,orderItems,goods表,记录电商相关信息,需要给每类customers推荐他们最感兴趣的商品
表表关系为:
二、思路:
- 获得特征:组成代表顾客消费特征的DataFrame(如用户年龄,用户会员等级)
- 归一化特征:除了ID标识,所有特征归一化成feature一列,训练成模型model
- 确定K值:针对每个K值(2,3,4,5...),计算每个K值对应的SSD(sum of squared distance)大小,K值越大SSD越小,取K-SSD曲线平稳的最小K值
- 使用Jfreechart画图,手动确定K
- 训练model,产生prediction(分组)
- 分组后,使用DF 获取每组用户购买的前30名商品
三、具体实现:
1.获得特征
数据清洗:
- 文字 => 数字,通过StringIndexer
- 使用自定义UDF函数确定每一类属于什么分级
- 拼接有效的列
- 所有列转换成DoubleType
辅助def:
def readMySQL(spark: SparkSession,tableName:String) = { val map: Map[String, String] = Map[String, String]( elems = "url" -> "jdbc:mysql://192.168.56.111:3306/myshops2", "driver" -> "com.mysql.jdbc.Driver", "user" -> "root", "password" -> "root", "dbtable" -> tableName ) spark.read.format("jdbc").options(map).load() } val func_membership = udf { (score: Int) => { score match { case i if i < 100 => 1 case i if i < 500 => 2 case i if i < 1000 => 3 case _ => 4 } } } val func_bir = udf { (idno: String, now: String) => { val year = idno.substring(6, 10).toInt val month = idno.substring(10, 12).toInt val day = idno.substring(12, 14).toInt val dts = now.split("-") val nowYear = dts(0).toInt val nowMonth = dts(1).toInt val nowDay = dts(2).toInt if (nowMonth > month) { nowYear - year } else if (nowMonth < month) { nowYear - 1 - year } else { if (nowDay >= day) { nowYear - year } else { nowYear - 1 - year } } } } val func_age = udf { (num: Int) => { num match { case n if n < 10 => 1 case n if n < 18 => 2 case n if n < 23 => 3 case n if n < 35 => 4 case n if n < 50 => 5 case n if n < 70 => 6 case _ => 7 } } } val func_userscore = udf { (sc: Int) => { sc match { case s if s < 100 => 1 case s if s < 500 => 2 case _ => 3 } } } val func_logincount = udf { (sc: Int) => { sc match { case s if s < 500 => 1 case _ => 2 } } }
main方法:
val spark = SparkSession.builder().appName("db").master("local[*]").getOrCreate() val featureDataTable = readMySQL(spark,"customs").filter("active!=0").select("cust_id", "company", "province_id", "city_id", "district_id" , "membership_level", "create_at", "last_login_time", "idno", "biz_point", "sex", "marital_status", "education_id" , "login_count", "vocation", "post") //商品表 val goodTable=readMySQL(spark,"goods").select("good_id","price") //订单表 val orderTable=readMySQL(spark,"orders").select("ord_id","cust_id") //订单明细表 val orddetailTable=readMySQL(spark,"orderItems").select("ord_id","good_id","buy_num") //先将公司名通过StringIndex转为数字 val compIndex = new StringIndexer().setInputCol("company").setOutputCol("compId") //使用自定义UDF函数 import spark.implicits._ //计算每个用户购买的次数 val tmp_bc=orderTable.groupBy("cust_id").agg(count($"ord_id").as("buycount")) //计算每个用户在网站上花费了多少钱 val tmp_pay=orderTable.join(orddetailTable,Seq("ord_id"),"inner").join(goodTable,Seq("good_id"),"inner").groupBy("cust_id"). agg(sum($"buy_num"*$"price").as("pay")) val df=compIndex.fit(featureDataTable).transform(featureDataTable) .withColumn("mslevel", func_membership($"membership_level")) .withColumn("min_reg_date", min($"create_at") over()) .withColumn("reg_date", datediff($"create_at", $"min_reg_date")) .withColumn("min_login_time", min("last_login_time") over()) // 窗口函数实现groupby的聚合函数功能,又能显示每行数据 .withColumn("lasttime", datediff($"last_login_time", $"min_login_time")) // 为什么每个时间要-最小时间?时间数字太大,减小数字收敛更快 .withColumn("age", func_age(func_bir($"idno", current_date()))) // 如何包装常量为Column?lit()函数 .withColumn("user_score", func_userscore($"biz_point")) .withColumn("logincount", func_logincount($"login_count")) // 右表:有的用户可能没有买/没花钱,所以是left join .join(tmp_bc,Seq("cust_id"),"left").join(tmp_pay,Seq("cust_id"),"left") .na.fill(0) .drop("company", "membership_level", "create_at", "min_reg_date" // 使用withColumn方法需要drop列,select则选什么显示什么 , "last_login_time", "min_login_time", "idno", "biz_point", "login_count") //将所有列换成数字 val columns=df.columns.map(f=>col(f).cast(DoubleType)) val num_fmt=df.select(columns:_*)
2.归一化特征
//将除了第一列的所有列都组装成一个向量列 val va=new VectorAssembler().setInputCols(Array("province_id","city_id","district_id","sex","marital_status","education_id","vocation","post","compId","mslevel","reg_date","lasttime","age","user_score","logincount","buycount","pay")) .setOutputCol("orign_feature") val ofdf=va.transform(num_fmt).select("cust_id","orign_feature") //将原始特征列归一化处理 val mmScaler:MinMaxScaler=new MinMaxScaler().setInputCol("orign_feature").setOutputCol("feature") //fit产生模型 把ofdf放到模型里使用 val resdf=mmScaler.fit(ofdf) // 训练模型 MinMaxScalerModel .transform(ofdf) // 设置进参数 .select("cust_id","feature").cache() // 归一成"feature" 一列
3.确定K值
//使用Kmeans算法进行分组 //计算根据不同的质心点计算所有的距离 //记录不同质心点距离的集合 val disList:ListBuffer[Double]=ListBuffer[Double]() for (i<-2 to 40){ // 计划K从2取到40 val kms=new KMeans().setFeaturesCol("feature").setK(i) val model=kms.fit(resdf) // 为什么不transform ?? // 目的不是产生df:cust_id,feature和对应的group(prediction) // 目的是用computeCost算K数量对应的[SSD] disList.append(model.computeCost(resdf)) } //调用绘图工具绘图 val chart=new LineGraph("app","Kmeans质心和距离",disList) chart.pack() RefineryUtilities.centerFrameOnScreen(chart) chart.setVisible(true)
等CPU烧15分钟可运行出:
4.分组,使用DF
//使用 Kmeans 进行分组:找一个稳定的 K 值 val kms = new KMeans().setFeaturesCol("feature").setK(40) val user_group_tab=kms.fit(resdf) .transform(resdf) // 产出 custId,feature,prediction .drop("feature") .withColumnRenamed("prediction","groups") // 获得 custId,groups // .show(false) //获取每组用户购买的前30名商品 // row_number 根据组分组,买的次数desc // groupby 组和商品,count买的次数order_id val rank=30 val wnd=Window.partitionBy("groups").orderBy(desc("group_buy_count")) user_group_tab.join(orderTable,Seq("cust_id"),"left").join(orddetailTable,Seq("ord_id"),"left"). na.fill(0) .groupBy("groups","good_id") .agg(count("ord_id").as("group_buy_count")) .withColumn("rank",row_number()over(wnd)) .filter($"rank"<=rank).show(false)
结果: