大数据Spark实时处理--结构化流实现数据分析1(Structured Streaming)

数据清洗

  • 添加依赖
  • jedis2.9.3
  • C:\Users\jieqiong\IdeaProjects\log-time\log-sss\pom.xml
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.3</version>
</dependency>
  • 依赖ip2region
  • C:\Users\jieqiong\IdeaProjects\log-time\log-sss\pom.xml
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>ip2region</artifactId>
<version>1.7.2</version>
</dependency>
  • 将ip2region.db放入路径:C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\resources\ip2region.db
  • 新建C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\IPUtils.java
  • 通过IPUtils调用parseIP
  • 测试210.51.167.169ip为北京
复制代码
package com.imooc.spark.sss.project;
import org.lionsoul.ip2region.DataBlock;
import org.lionsoul.ip2region.DbConfig;
import org.lionsoul.ip2region.DbMakerConfigException;
import org.lionsoul.ip2region.DbSearcher;

import java.io.FileNotFoundException;
import java.io.IOException;

public class IPUtils {
    public static String parseIP(String ip) {
        String result = "";
        String dbFile = IPUtils.class.getClassLoader().getResource("ip2region.db").getPath();
        DbSearcher search = null;
        try {
            search = new DbSearcher(new DbConfig(), dbFile);
            DataBlock dataBlock = search.btreeSearch(ip);
            String region = dataBlock.getRegion();
            String replace = region.replace("|", ",");
            String[] splits = replace.split(",");
            if (splits.length == 5) {
                result = splits[2];
            }
            return result;
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if(search!=null) search.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return result;
    }

    public static void main(String[] args) {
        String detail = IPUtils.parseIP("210.51.167.169");
        System.out.println(detail);
    }
}
复制代码
  • 新建:C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\SSSApp.scala
  • 查一下zhang-replicated-topic的偏移量
  • 测试后,有数据输出

[hadoop@spark000 bin]$ pwd
/home/hadoop/app/kafka_2.12-2.5.0/bin
[hadoop@spark000 bin]$ kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 127.0.0.1:9092 --topic zhang-replicated-topic --time -1
zhang-replicated-topic:0:20230

 

复制代码
package com.imooc.spark.sss.project

import java.sql.Timestamp
import com.imooc.spark.sss.SourceApp.eventTimeWindow
import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.streaming.OutputMode
import redis.clients.jedis.Jedis

object SSSApp {

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[2]")
// .config("spark.sql.shuffle.partitions","10")
.appName(this.getClass.getName)
.getOrCreate()

// TODO... 从已经保存过的offset中获取

import spark.implicits._
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "spark000:9092,spark000:9093,spark000:9094")
.option("subscribe", "zhang-replicated-topic")
.option("startingOffsets", """{"zhang-replicated-topic":{"0":15000}}""")
.load()
.selectExpr("CAST(value AS STRING)")
.as[String].map(x => {
val splits = x.split("\t")
val time = splits(0)
val ip = splits(2)

(new Timestamp(time.toLong), DateUtils.parseToDay(time), IPUtils.parseIP(ip))
}).toDF("ts","day","province")
.withWatermark("ts","10 minutes")
.writeStream
.format("console") // 这的console操作是结果显示在控制台 ==> Redis
.outputMode(OutputMode.Update())
.start()
.awaitTermination()
}
}
复制代码
  • 新建:C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\DateUtils.scala
复制代码
package com.imooc.spark.sss.project

import org.apache.commons.lang3.time.FastDateFormat

object DateUtils {


  // 1597039092628
  val TARGET_FORMAT = FastDateFormat.getInstance("yyyyMMdd")


  def parseToDay(time:String) = {
    TARGET_FORMAT.format(time.toLong)
  }

  def main(args: Array[String]): Unit = {
    println(parseToDay("1597039092628"))
  }

}
复制代码

 

Redis快速入门

  [root@spark000 tmp]# pwd
  /opt/tmp
  [root@spark000 tmp]# ls
  gcc-9.3.0.tar.gz

   [root@spark000 tmp]# tar -zxvf gcc-9.3.0.tar.gz

  • 下载redis6.06版本至software,并更改配置文件
复制代码
-rw-rw-r-- 1 hadoop hadoop   2228781 Mar 23 14:30 redis-6.0.6.tar.gz
[hadoop@spark000 software]$ tar -zxvf redis-6.0.6.tar.gz
[hadoop@spark000 redis-6.0.6]$ make distclean
[hadoop@spark000 redis-6.0.6]$ sudo make
[hadoop@spark000 redis-6.0.6]$ sudo make install
[hadoop@spark000 redis-6.0.6]$ vi redis.conf
daemonize no===>改为===>daemonize yes
pidfile /var/run/redis_6379.pid //(默认端口是6379)
databases 16 //(相当数据库自动创建16个库)
复制代码
  •  启动redis
[hadoop@spark000 redis-6.0.6]$ src/redis-server redis.conf   //(启动)
[hadoop@spark000 redis-6.0.6]$ ps -ef|grep redis //(查看启动状态)
[hadoop@spark000 redis-6.0.6]$ src/redis-cli //(连接)
  • redis 基础命令
  • hash 掌握
  • 可以设置字段、字段值(可有多个)
复制代码
127.0.0.1:6379> keys *          //展现当前库中所有的key
(empty array)
127.0.0.1:6379> select 1       //使用第二个库
OK
127.0.0.1:6379[1]> select 0    //使用第一个库    
OK
127.0.0.1:6379> set user1 jieqiong     //设置第一个key value
OK
127.0.0.1:6379> set user2 jieqiong
OK
127.0.0.1:6379> set user3 jieqiong
OK
127.0.0.1:6379> get user1              //获取第一个key所对应的value
"jieqiong"
127.0.0.1:6379> keys user[1-4]         //展示key名
1) "user3"
2) "user2"
3) "user1"
127.0.0.1:6379> exists user            //搜索是否包含此key,有为1,无为0
(integer) 0
127.0.0.1:6379> exists user1
(integer) 1
127.0.0.1:6379> exists user19
(integer) 0
127.0.0.1:6379> keys *                //展示当前库中所有的key
1) "user3"
2) "user2"
3) "user1"
127.0.0.1:6379> del user1 user2 user3   //删除多个key
(integer) 3
127.0.0.1:6379> keys *
(empty array)
127.0.0.1:6379> help                   //帮助
redis-cli 6.0.6
To get help about Redis commands type:
      "help @<group>" to get a list of commands in <group>
      "help <command>" for help on <command>
      "help <tab>" to get a list of possible help topics
      "quit" to exit

To set redis-cli preferences:
      ":set hints" enable online hints
      ":set nohints" disable online hints
Set your preferences in ~/.redisclirc
127.0.0.1:6379> help @string          //查看string所有用法
127.0.0.1:6379> help @hash          //查看hash所有用法

127.0.0.1:6379> hset user:100 name pk
(integer) 1
127.0.0.1:6379> hset user:100 age 30
(integer) 1
127.0.0.1:6379> hget user:100 name
"pk"
127.0.0.1:6379> hget user:100 age
"30"
127.0.0.1:6379> hmset user:100 age 29 gender male
OK
127.0.0.1:6379> hmget user:100 age gender
1) "29"
2) "male"
127.0.0.1:6379> hgetall user:100
1) "name"
2) "pk"
3) "age"
4) "29"
5) "gender"
6) "male"
127.0.0.1:6379> hincrby user:100 age 2
(integer) 31
127.0.0.1:6379> hgetall user:100
1) "name"
2) "pk"
3) "age"
4) "31"
5) "gender"
6) "male"
127.0.0.1:6379> hlen user:100
(integer) 3
127.0.0.1:6379> hvals user:100
1) "pk"
2) "31"
3) "male"
127.0.0.1:6379> hkeys user:100
1) "name"
2) "age"
3) "gender"
127.0.0.1:6379> hdel user:100 gender
(integer) 1
127.0.0.1:6379> hlen user:100
(integer) 2
127.0.0.1:6379> hvals user:100
1) "pk"
2) "31"
127.0.0.1:6379> hkeys user:100
1) "name"
2) "age"
127.0.0.1:6379> hgetall user:100
1) "name"
2) "pk"
3) "age"
4) "31"
127.0.0.1:6379>

复制代码

 

通过jedis API 操作Redis

复制代码
package com.imooc.spark.sss.project

import java.util

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}

object RedisUtils {

  private val jedisPoolConfig = new JedisPoolConfig()
  jedisPoolConfig.setMaxTotal(100) //最大连接数
  jedisPoolConfig.setMaxIdle(20) //最大空闲
  jedisPoolConfig.setMinIdle(20) //最小空闲
  jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待
  jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒
  jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试
  private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379)

  def getJedisClient:Jedis = {
    jedisPool.getResource
  }

  def main(args: Array[String]): Unit = {
    getJedisClient.set("stu","pk")
  }
}
复制代码
127.0.0.1:6379> get stu
"pk"
  • 测试二,API存数据,redis中查
复制代码
package com.imooc.spark.sss.project

import java.util

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}

object RedisUtils {


  private val jedisPoolConfig = new JedisPoolConfig()
  jedisPoolConfig.setMaxTotal(100) //最大连接数
  jedisPoolConfig.setMaxIdle(20) //最大空闲
  jedisPoolConfig.setMinIdle(20) //最小空闲
  jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待
  jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒
  jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试
  private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379)

  def getJedisClient:Jedis = {
    jedisPool.getResource
  }

  def main(args: Array[String]): Unit = {

        getJedisClient.hset("imooc-user-100","name","pk")
        getJedisClient.hset("imooc-user-100","age","31")
        getJedisClient.hset("imooc-user-100","gender","m")
  }
}
复制代码
复制代码
127.0.0.1:6379> keys *
1) "user1"
2) "stu"
3) "imooc-user-100"
4) "user:100"
127.0.0.1:6379> hgetall imooc-user-100
1) "name"
2) "pk"
3) "age"
4) "31"
5) "gender"
6) "m"
复制代码
  • 测试三,API取数据,idea的控制台查
复制代码
package com.imooc.spark.sss.project

import java.util

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}

object RedisUtils {


  private val jedisPoolConfig = new JedisPoolConfig()
  jedisPoolConfig.setMaxTotal(100) //最大连接数
  jedisPoolConfig.setMaxIdle(20) //最大空闲
  jedisPoolConfig.setMinIdle(20) //最小空闲
  jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待
  jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒
  jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试
  private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379)

  def getJedisClient:Jedis = {
    jedisPool.getResource
  }
def main(args: Array[String]): Unit = { import scala.collection.JavaConversions._ val result: util.Map[String, String] = getJedisClient.hgetAll("imooc-user-100") for((k,v) <- result) { println(k + "-->" + v) } } }
复制代码

 

统计结果Sink到Redis

  • C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\SSSApp.scala
  • 依次开启dfs、yarn、log-web.jar、access-kafka.conf、zookeeper、kafka、consumer、master
  • 测试正常,数据成功导出到redis中
复制代码
package com.imooc.spark.sss.project

import java.sql.Timestamp
import com.imooc.spark.sss.SourceApp.eventTimeWindow
import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.streaming.OutputMode
import redis.clients.jedis.Jedis

object SSSApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
            .master("local[2]")
      //       .config("spark.sql.shuffle.partitions","10")
            .appName(this.getClass.getName)
      .getOrCreate()

    // TODO... 从已经保存过的offset中获取

    import spark.implicits._
    spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "spark000:9092,spark000:9093,spark000:9094")
      .option("subscribe", "zhang-replicated-topic")
      //.option("startingOffsets", """{"zhang-replicated-topic":{"0":15000}}""")
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String].map(x => {
      val splits = x.split("\t")
      val time = splits(0)
      val ip = splits(2)

      (new Timestamp(time.toLong), DateUtils.parseToDay(time), IPUtils.parseIP(ip))
    }).toDF("ts","day","province")
        .withWatermark("ts","10 minutes")
        .groupBy("day","province")
      .count()
      .writeStream
      //.format("console") // 这的console操作是结果显示在控制台 ==> Redis
      .outputMode(OutputMode.Update())
      .foreach(new ForeachWriter[Row] {
        var client:Jedis = _
        override def process(value: Row): Unit = {
          val day = value.getString(0)
          val province = value.getString(1)
          val cnts = value.getLong(2)
          //            val offset = value.getAs[String]("offset")
          //            client.set("","")
          client.hset("day-province-cnts-"+day, province, cnts+"")
        }
        override def close(errorOrNull: Throwable): Unit = {
          if(null != client) {
            RedisUtils.returnResource(client)
          }
        }
        override def open(partitionId: Long, epochId: Long): Boolean = {
          client = RedisUtils.getJedisClient
          client != null
        }
      })
      //.option("checkpointLocation","./chk")
      .start()
      .awaitTermination()
  }
}
复制代码
  •  C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\RedisUtils.scala
复制代码
package com.imooc.spark.sss.project

import java.util

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}

object RedisUtils {


  private val jedisPoolConfig = new JedisPoolConfig()
  jedisPoolConfig.setMaxTotal(100) //最大连接数
  jedisPoolConfig.setMaxIdle(20) //最大空闲
  jedisPoolConfig.setMinIdle(20) //最小空闲
  jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待
  jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒
  jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试
  private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379)

  def getJedisClient:Jedis = {
    jedisPool.getResource
  }

  def returnResource(jedis:Jedis): Unit = {
    jedisPool.returnResource(jedis)
  }

  def main(args: Array[String]): Unit = {

    import scala.collection.JavaConversions._
    val result: util.Map[String, String] = getJedisClient.hgetAll("imooc-user-100")

    for((k,v) <- result) {
      println(k + "-->" + v)
    }
  }

}
复制代码

 

服务器运行

  • 将相应的jar包放入该路径下。
  • ip2region-1.7.2.jar、ip2region.db、jedis-2.9.3.jar、log-sss-1.0.jar
复制代码
[hadoop@spark000 lib]$ pwd
/home/hadoop/lib
[hadoop@spark000 lib]$ ll
total 31984
-rw-rw-r-- 1 hadoop hadoop    16732 Apr  6 16:05 ip2region-1.7.2.jar
-rw-rw-r-- 1 hadoop hadoop  8397900 Apr  6 15:45 ip2region.db
-rw-rw-r-- 1 hadoop hadoop   563737 Apr  6 16:05 jedis-2.9.3.jar
-rw-rw-r-- 1 hadoop hadoop     8077 Mar 29 09:57 log-generator-1.0.jar
-rw-rw-r-- 1 hadoop hadoop  3836855 Apr  6 15:39 log-sss-1.0.jar
-rw-rw-r-- 1 hadoop hadoop 19834079 Mar 29 09:59 log-web-0.0.1-SNAPSHOT.jar
-rw------- 1 hadoop hadoop    78701 Apr  6 10:34 nohup.out
复制代码
  • 脚本
  • 运行成功,但结果与本地测试的结果不同
spark-submit \
--master yarn \
--name SSSApp \
--class com.imooc.spark.sss.project.SSSApp \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 \
--jars /home/hadoop/lib/jedis-2.9.3.jar,/home/hadoop/lib/ip2region-1.7.2.jar \
--files /home/hadoop/lib/ip2region.db \
/home/hadoop/lib/log-sss-1.0.jar

 

posted @   酱汁怪兽  阅读(298)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
点击右上角即可分享
微信分享提示