spark自定义分区实例
数据准备
数据格式说明
//视频名称 视频网站 播放量 收藏数 评论数 踩数 赞数
川东游击队 3 2713 0 0 0 0
视频网站中数字所代表的的网站:1优酷2搜狐3土豆4爱奇艺5迅雷看看
实例需求
将相同的视频网站类型的数据放到同一个分区,以便可以按网站类别进行统计每个电视剧的每个指标的总量
实例步骤
-
自定义一个分区类,继承
Partitioner
class TVPlayPartitioner(numPars : Int) extends Partitioner{ override def numPartitions: Int = numPars override def getPartition(key: Any): Int = { var code = key.hashCode % numPartitions if (code < 0) { code + numPartitions } else { code } } override def hashCode(): Int = numPartitions override def equals(obj: scala.Any): Boolean =obj match { case tvplay : TVPlayPartitioner => tvplay.numPartitions == numPartitions case _ => false } }
2.读取测试数据文件
val input = "/data/spark-example/tvplay/tvplay.txt"
val data = sc.textFile(input)
3.通过map
算子将每行数据的视频网站截取出来作为key
val rdd = data.map(line => (line.split("\t")(1).toInt, line))
4.通过partitionBy
算子对RDD指定TVPlayPartitioner
分区,因为有5个视频网站,所以定义分区数为5
val newRDD = rdd.partitionBy(new TVPlayPartitioner(5))
5.验证分区是否正确
newRDD.glom().collect
完整代码
1.TVPlayPartitioner
import org.apache.spark.Partitioner
class TVPlayPartitioner(numPars : Int) extends Partitioner{
override def numPartitions: Int = numPars
override def getPartition(key: Any): Int = {
var code = key.hashCode % numPartitions
if (code < 0) {
code + numPartitions
} else {
code
}
}
override def hashCode(): Int = numPartitions
override def equals(obj: scala.Any): Boolean =obj match {
case tvplay : TVPlayPartitioner =>
tvplay.numPartitions == numPartitions
case _ =>
false
}
}
2.TVPlayCount
import org.apache.spark.{SparkConf, SparkContext}
object TVPlayCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("tvplay-count")
val sc = new SparkContext(conf)
val input = "/data/spark-example/tvplay/tvplay.txt"
val data = sc.textFile(input)
val rdd = data.map(line => (line.split("\t")(1).toInt, line))
val newRDD = rdd.partitionBy(new TVPlayPartitioner(5))
newRDD.glom().collect
sc.stop()
}
}