spark自定义分区实例

数据准备

数据格式说明

//视频名称 视频网站 播放量 收藏数 评论数 踩数 赞数
川东游击队	3	2713	0	0	0	0

视频网站中数字所代表的的网站：1优酷2搜狐3土豆4爱奇艺5迅雷看看

实例需求

将相同的视频网站类型的数据放到同一个分区，以便可以按网站类别进行统计每个电视剧的每个指标的总量

实例步骤

自定义一个分区类，继承Partitioner

 class TVPlayPartitioner(numPars : Int) extends Partitioner{

 override def numPartitions: Int = numPars
 
 override def getPartition(key: Any): Int = {
     var code = key.hashCode % numPartitions
     if (code < 0) {
         code + numPartitions
     } else {
         code
     }
 }
 
 override def hashCode(): Int = numPartitions
 
 override def equals(obj: scala.Any): Boolean =obj match {
     case tvplay : TVPlayPartitioner =>
         tvplay.numPartitions == numPartitions
     case _ =>
         false
 }
 }

2.读取测试数据文件

val input = "/data/spark-example/tvplay/tvplay.txt"
val data = sc.textFile(input)

3.通过map算子将每行数据的视频网站截取出来作为key

val rdd = data.map(line => (line.split("\t")(1).toInt, line))

4.通过partitionBy算子对RDD指定TVPlayPartitioner分区，因为有5个视频网站，所以定义分区数为5

val newRDD = rdd.partitionBy(new TVPlayPartitioner(5))

5.验证分区是否正确

newRDD.glom().collect

完整代码

1.TVPlayPartitioner

import org.apache.spark.Partitioner

class TVPlayPartitioner(numPars : Int) extends Partitioner{
    
    override def numPartitions: Int = numPars
    
    override def getPartition(key: Any): Int = {
        var code = key.hashCode % numPartitions
        if (code < 0) {
            code + numPartitions
        } else {
            code
        }
    }
    
    override def hashCode(): Int = numPartitions
    
    override def equals(obj: scala.Any): Boolean =obj match {
        case tvplay : TVPlayPartitioner =>
            tvplay.numPartitions == numPartitions
        case _ =>
            false
    }
}

2.TVPlayCount

import org.apache.spark.{SparkConf, SparkContext}

object TVPlayCount {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("tvplay-count")
        val sc = new SparkContext(conf)
        val input = "/data/spark-example/tvplay/tvplay.txt"
        val data = sc.textFile(input)
        val rdd = data.map(line => (line.split("\t")(1).toInt, line))
        val newRDD = rdd.partitionBy(new TVPlayPartitioner(5))
        newRDD.glom().collect
        sc.stop()
    }
}

posted @ 2018-08-11 00:59 oldsix666 阅读(840) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

一条五香

愿漂泊的人都有酒喝，愿孤独的人都会唱歌。