spark之RFM客户价值分群挖掘
一.RFM
RFM是一种用来衡量当前客户价值和潜在客户价值的重要工具和手段。
在面向客户制定运营策略、营销策略时,我们希望能够针对不同的客户推行不同的策略,实现精准化运营,以期获取最大的转化率。精准化运营的前提是客户关系管理,而客户关系管理的核心是客户分类。
通过客户分类,对客户群体进行细分,区别出低价值客户、高价值客户,对不同的客户群体开展不同的个性化服务,将有限的资源合理地分配给不同价值的客户,实现效益最大化。
在客户分类中,RFM模型是一个经典的客户分群模型,模型利用通用交易环节中最核心的三个维度:
1.最近消费(Recency)自上次购买以来的持续时间
2.消费频率(Frequency)购买总数
3.消费金额(Monetary)该客户花费的总金额
二.分群方法
接下来需要对统计后的rfm数据进行分隔,以划分和分析不同的客户价值,即对客户分群,有3种方案可以对客户分群:
1.根据经验,熟悉业务的人进行定义划分标准,需要不断修正
2.统计每列的分位数,根据分位数进行划分(spark中没有分位统计函数,可利用python)
3.利用聚类自动划分(r、f、m作为特征,可统计更多特征)
三.数据
以下是数据集[online_retail.csv]的属性描述(数据集来自https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)):
1.InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
2.StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
3.Description: Product (item) name. Nominal.
4.Quantity: The quantities of each product (item) per transaction. Numeric.
5.InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
6.UnitPrice: Unit price. Numeric, Product price per unit in sterling.
7.CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
8.Country: Country name. Nominal, the name of the country where each customer resides.
四.代码(spark3.0,java1.8)
详细代码见:RFM客户价值分群挖掘(https://github.com/jiangnanboy/spark_tutorial),本程序统计RFM后利用kmeans进行顾客价值自动聚类分群。
/** * @param session */ public static Dataset<Row> rfmStatistics(SparkSession session) { /** * 以下是数据集[online_retail.csv]的属性描述(数据集来自https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)): * * InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. * StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. * Description: Product (item) name. Nominal. * Quantity: The quantities of each product (item) per transaction. Numeric. * InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated. * UnitPrice: Unit price. Numeric, Product price per unit in sterling. * CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. * Country: Country name. Nominal, the name of the country where each customer resides. */ String path = PropertiesReader.get("intermediate_customer_value_csv"); Dataset<Row> dataset = session.read() .option("sep", ",") .option("header", "true") .option("inferSchema", "true") .csv(path); /**统计每列值的数量 * +----------------+---------------+-----------------+--------------+-----------------+---------------+----------------+-------------+ * |InvoiceNo_count |StockCode_count|Description_count|Quantity_count|InvoiceDate_count|UnitPrice_count|CustomerID_count|Country_count| * +----------------+---------------+-----------------+--------------+-----------------+---------------+----------------+-------------+ * | 541909| 541909| 540455| 541909| 541909| 541909| 406829| 541909| * +----------------+---------------+-----------------+--------------+-----------------+---------------+----------------+-------------+ */ //datasetColumnCount(dataset); //以上统计发现CustomerID有空值,进行过滤去除含有null/NAN的行,“any”为只要有缺失值就删除这一行 dataset = dataset.na().drop("any"); /**过滤空值后,统计每列值的数量 * +----------------+---------------+-----------------+--------------+-----------------+---------------+----------------+-------------+ * |InvoiceNo_count |StockCode_count|Description_count|Quantity_count|InvoiceDate_count|UnitPrice_count|CustomerID_count|Country_count| * +----------------+---------------+-----------------+--------------+-----------------+---------------+----------------+-------------+ * | 406829| 406829| 406829| 406829| 406829| 406829| 406829| 406829| * +----------------+---------------+-----------------+--------------+-----------------+---------------+----------------+-------------+ */ //datasetColumnCount(dataset); //将InvoiceDate列中的非标准日期转为标准格式 dataset = dataset.map((MapFunction<Row, Row>) row -> { StringBuilder sb = new StringBuilder(); String invoiceDate = row.getString(4); String[] tokens = invoiceDate.trim().split("\\s+"); // 2011/7/27 15:12 // year/month/day String yearMonthDay = tokens[0]; String[] yMD = yearMonthDay.split("\\/"); sb.append(yMD[0]).append("/"); if(1 == yMD[1].length()) { sb.append("0").append(yMD[1]).append("/"); } else { sb.append(yMD[1]).append("/"); } if(1 == yMD[2].length()) { sb.append("0").append(yMD[2]).append(" "); } else { sb.append(yMD[2]).append(" "); } // hour/min String hourMin = tokens[1]; String[] hm = hourMin.split(":"); if(1 == hm[0].length()) { sb.append("0").append(hm[0]).append(":"); } else { sb.append(hm[0]).append(":"); } if(1 == hm[1].length()) { sb.append("0").append(hm[1]); } else { sb.append(hm[1]); } return RowFactory.create(row.getString(0), row.getString(1), row.getString(2), row.getInt(3), sb.toString(), row.getDouble(5), row.getInt(6), row.getString(7)); }, RowEncoder.apply(InitSchema.initOnlineRetailSchema())); /** * 将InvoiceDate列转为时间戳,新增一列时间戳NewInvoiceDate * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+ * |InvoiceNo|StockCode| Description|Quantity| InvoiceDate|UnitPrice|CustomerID| Country| NewInvoiceDate| * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+ * | 536365| 85123A|WHITE HANGING HEA...| 6|2010/12/01 08:26| 2.55| 17850|United Kingdom|2010-12-01 08:26:00| * | 536365| 71053| WHITE METAL LANTERN| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| * | 536365| 84406B|CREAM CUPID HEART...| 8|2010/12/01 08:26| 2.75| 17850|United Kingdom|2010-12-01 08:26:00| * | 536365| 84029G|KNITTED UNION FLA...| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| * | 536365| 84029E|RED WOOLLY HOTTIE...| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+ */ //dataset = dataset.withColumn("NewInvoiceDate",functions.to_utc_timestamp(functions.unix_timestamp(col("InvoiceDate"), "yyyy/MM/dd HH:mm").cast("timestamp"), "UTC")); dataset = dataset.withColumn("NewInvoiceDate", functions.unix_timestamp(col("InvoiceDate"),"yyyy/MM/dd HH:mm").cast(DataTypes.TimestampType)); /** * 计算总额:=》 Quantity*UnitPrice作为新列TotalPrice * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+----------+ * |InvoiceNo|StockCode| Description|Quantity| InvoiceDate|UnitPrice|CustomerID| Country| NewInvoiceDate|TotalPrice| * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+----------+ * | 536365| 85123A|WHITE HANGING HEA...| 6|2010/12/01 08:26| 2.55| 17850|United Kingdom|2010-12-01 08:26:00| 15.3| * | 536365| 71053| WHITE METAL LANTERN| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| 20.34| * | 536365| 84406B|CREAM CUPID HEART...| 8|2010/12/01 08:26| 2.75| 17850|United Kingdom|2010-12-01 08:26:00| 22.0| * | 536365| 84029G|KNITTED UNION FLA...| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| 20.34| * | 536365| 84029E|RED WOOLLY HOTTIE...| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| 20.34| * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+----------+ */ dataset = dataset.withColumn("TotalPrice", functions.round(col("Quantity").multiply(col("UnitPrice")), 2)); //获取NewInvoiceDate列中最大时间戳 List<Row> maxInvoiceRow = dataset.select(functions.max(col("NewInvoiceDate")).as("MaxInvoiceDate")).collectAsList(); Timestamp maxTimeStamp = maxInvoiceRow.get(0).getTimestamp(0); /** * 计算时间差=NewInvoiceDate列中最大时间 - 每列的时间,新增时间差列Duration为相差天数 * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+----------+--------+ * |InvoiceNo|StockCode| Description|Quantity| InvoiceDate|UnitPrice|CustomerID| Country| NewInvoiceDate|TotalPrice|Duration| * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+----------+--------+ * | 536365| 85123A|WHITE HANGING HEA...| 6|2010/12/01 08:26| 2.55| 17850|United Kingdom|2010-12-01 08:26:00| 15.3| 373| * | 536365| 71053| WHITE METAL LANTERN| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| 20.34| 373| * | 536365| 84406B|CREAM CUPID HEART...| 8|2010/12/01 08:26| 2.75| 17850|United Kingdom|2010-12-01 08:26:00| 22.0| 373| * | 536365| 84029G|KNITTED UNION FLA...| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| 20.34| 373| * | 536365| 84029E|RED WOOLLY HOTTIE...| 6|2010/12/01 08:26| 3.39| 17850|United Kingdom|2010-12-01 08:26:00| 20.34| 373| * +---------+---------+--------------------+--------+----------------+---------+----------+--------------+-------------------+----------+--------+ */ dataset = dataset.withColumn("Duration", functions.datediff(functions.lit(maxTimeStamp), col("NewInvoiceDate"))); /** * 计算 RFM => Recency, Frequency, Monetary * 最近-客户最近购买了多少?即自上次购买以来的持续时间 * 频率——他们多久购买一次?即购买总数 * 货币价值——他们花了多少钱?即该客户花费的总金额 * * +----------+-------+ * |CustomerID|Recency| * +----------+-------+ * | 17420| 50| * | 16861| 59| * | 16503| 106| * | 15727| 16| * | 17389| 0| * +----------+-------+ * * +----------+---------+ * |CustomerID|Frequence| * +----------+---------+ * | 15619| 1| * | 17389| 43| * | 12940| 4| * | 13623| 7| * | 14450| 3| * +----------+---------+ * * +----------+--------+ * |CustomerID|Monetary| * +----------+--------+ * | 17420| 598.83| * | 16861| 151.65| * | 16503| 1421.43| * | 15727| 5178.96| * | 17389|31300.08| * +----------+--------+ * */ Dataset<Row> recencyDataset = dataset.groupBy("CustomerID").agg(functions.min(col("Duration")).as("Recency")); Dataset<Row> frequencyDataset = dataset.groupBy("CustomerID", "InvoiceNo").count().groupBy("CustomerID").agg(functions.count("*").as("Frequence")); Dataset<Row> monetaryDataset = dataset.groupBy("CustomerID").agg(functions.round(functions.sum("TotalPrice"), 2).as("Monetary")); /** * 连接recencyDataset、frequencyDataset、monetaryDataset,获得RFM的统计 * +----------+-------+---------+--------+ * |CustomerID|Recency|Frequence|Monetary| * +----------+-------+---------+--------+ * | 12940| 46| 4| 876.29| * | 13285| 23| 4| 2709.12| * | 13623| 30| 7| 672.44| * | 13832| 17| 2| 40.95| * | 14450| 180| 3| 483.25| * +----------+-------+---------+--------+ */ Dataset<Row> rfmDataset = recencyDataset.join(frequencyDataset, "CustomerID").join(monetaryDataset, "CustomerID"); return rfmDataset; }