spark自定义分区实例

数据准备

数据集下载点击这里

数据格式说明

//视频名称 视频网站 播放量 收藏数 评论数 踩数 赞数
川东游击队	3	2713	0	0	0	0

视频网站中数字所代表的的网站:1优酷2搜狐3土豆4爱奇艺5迅雷看看

实例需求

将相同的视频网站类型的数据放到同一个分区,以便可以按网站类别进行统计每个电视剧的每个指标的总量

实例步骤

  1. 自定义一个分区类,继承Partitioner

     class TVPlayPartitioner(numPars : Int) extends Partitioner{
    
     override def numPartitions: Int = numPars
     
     override def getPartition(key: Any): Int = {
         var code = key.hashCode % numPartitions
         if (code < 0) {
             code + numPartitions
         } else {
             code
         }
     }
     
     override def hashCode(): Int = numPartitions
     
     override def equals(obj: scala.Any): Boolean =obj match {
         case tvplay : TVPlayPartitioner =>
             tvplay.numPartitions == numPartitions
         case _ =>
             false
     }
     }
    

2.读取测试数据文件

val input = "/data/spark-example/tvplay/tvplay.txt"
val data = sc.textFile(input)  

3.通过map算子将每行数据的视频网站截取出来作为key

val rdd = data.map(line => (line.split("\t")(1).toInt, line))

4.通过partitionBy算子对RDD指定TVPlayPartitioner分区,因为有5个视频网站,所以定义分区数为5

val newRDD = rdd.partitionBy(new TVPlayPartitioner(5))

5.验证分区是否正确

newRDD.glom().collect

完整代码

1.TVPlayPartitioner

import org.apache.spark.Partitioner

class TVPlayPartitioner(numPars : Int) extends Partitioner{
    
    override def numPartitions: Int = numPars
    
    override def getPartition(key: Any): Int = {
        var code = key.hashCode % numPartitions
        if (code < 0) {
            code + numPartitions
        } else {
            code
        }
    }
    
    override def hashCode(): Int = numPartitions
    
    override def equals(obj: scala.Any): Boolean =obj match {
        case tvplay : TVPlayPartitioner =>
            tvplay.numPartitions == numPartitions
        case _ =>
            false
    }
}

2.TVPlayCount

import org.apache.spark.{SparkConf, SparkContext}

object TVPlayCount {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("tvplay-count")
        val sc = new SparkContext(conf)
        val input = "/data/spark-example/tvplay/tvplay.txt"
        val data = sc.textFile(input)
        val rdd = data.map(line => (line.split("\t")(1).toInt, line))
        val newRDD = rdd.partitionBy(new TVPlayPartitioner(5))
        newRDD.glom().collect
        sc.stop()
    }
}
posted @ 2018-08-11 00:59  oldsix666  阅读(840)  评论(0编辑  收藏  举报