大数据Spark实时处理--结构化流实现数据分析1(Structured Streaming)
数据清洗
- 添加依赖
- jedis2.9.3
- C:\Users\jieqiong\IdeaProjects\log-time\log-sss\pom.xml
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.3</version>
</dependency>
- 依赖ip2region
- C:\Users\jieqiong\IdeaProjects\log-time\log-sss\pom.xml
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>ip2region</artifactId>
<version>1.7.2</version>
</dependency>
- 将ip2region.db放入路径:C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\resources\ip2region.db
- 新建C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\IPUtils.java
- 通过IPUtils调用parseIP
- 测试210.51.167.169ip为北京
package com.imooc.spark.sss.project; import org.lionsoul.ip2region.DataBlock; import org.lionsoul.ip2region.DbConfig; import org.lionsoul.ip2region.DbMakerConfigException; import org.lionsoul.ip2region.DbSearcher; import java.io.FileNotFoundException; import java.io.IOException; public class IPUtils { public static String parseIP(String ip) { String result = ""; String dbFile = IPUtils.class.getClassLoader().getResource("ip2region.db").getPath(); DbSearcher search = null; try { search = new DbSearcher(new DbConfig(), dbFile); DataBlock dataBlock = search.btreeSearch(ip); String region = dataBlock.getRegion(); String replace = region.replace("|", ","); String[] splits = replace.split(","); if (splits.length == 5) { result = splits[2]; } return result; } catch (FileNotFoundException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } finally { try { if(search!=null) search.close(); } catch (IOException e) { e.printStackTrace(); } } return result; } public static void main(String[] args) { String detail = IPUtils.parseIP("210.51.167.169"); System.out.println(detail); } }
- 新建:C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\SSSApp.scala
- 查一下zhang-replicated-topic的偏移量
- 测试后,有数据输出
[hadoop@spark000 bin]$ pwd
/home/hadoop/app/kafka_2.12-2.5.0/bin
[hadoop@spark000 bin]$ kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 127.0.0.1:9092 --topic zhang-replicated-topic --time -1
zhang-replicated-topic:0:20230
package com.imooc.spark.sss.project
import java.sql.Timestamp
import com.imooc.spark.sss.SourceApp.eventTimeWindow
import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.streaming.OutputMode
import redis.clients.jedis.Jedis
object SSSApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[2]")
// .config("spark.sql.shuffle.partitions","10")
.appName(this.getClass.getName)
.getOrCreate()
// TODO... 从已经保存过的offset中获取
import spark.implicits._
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "spark000:9092,spark000:9093,spark000:9094")
.option("subscribe", "zhang-replicated-topic")
.option("startingOffsets", """{"zhang-replicated-topic":{"0":15000}}""")
.load()
.selectExpr("CAST(value AS STRING)")
.as[String].map(x => {
val splits = x.split("\t")
val time = splits(0)
val ip = splits(2)
(new Timestamp(time.toLong), DateUtils.parseToDay(time), IPUtils.parseIP(ip))
}).toDF("ts","day","province")
.withWatermark("ts","10 minutes")
.writeStream
.format("console") // 这的console操作是结果显示在控制台 ==> Redis
.outputMode(OutputMode.Update())
.start()
.awaitTermination()
}
}
- 新建:C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\DateUtils.scala
package com.imooc.spark.sss.project import org.apache.commons.lang3.time.FastDateFormat object DateUtils { // 1597039092628 val TARGET_FORMAT = FastDateFormat.getInstance("yyyyMMdd") def parseToDay(time:String) = { TARGET_FORMAT.format(time.toLong) } def main(args: Array[String]): Unit = { println(parseToDay("1597039092628")) } }
Redis快速入门
- 数据库
- k-v键值对的内存数据库
- 满足海量数据读写请求
- k:字符串
- v:多种不同的数据类型(结构化流处理后的结果是hash类型。string、list、set、sorted set.....)
- 特性:速度、持久化、多种数据类型、多种编程语言
- 升级gcc,将以下两个版本结合执行,就ok
- (20条消息) gcc升级后为啥还是旧版本_Cent OS 7.6的gcc升级的操作方法,亲测可用一周的研究成果..._快跑啊小女孩的博客-CSDN博客
- (21条消息) Centos7安装redis报错make[1]: *** [server.o] Error 1_一碗烈酒的博客-CSDN博客
[root@spark000 tmp]# pwd
/opt/tmp
[root@spark000 tmp]# ls
gcc-9.3.0.tar.gz
[root@spark000 tmp]# tar -zxvf gcc-9.3.0.tar.gz
- 下载redis6.06版本至software,并更改配置文件
-rw-rw-r-- 1 hadoop hadoop 2228781 Mar 23 14:30 redis-6.0.6.tar.gz
[hadoop@spark000 software]$ tar -zxvf redis-6.0.6.tar.gz
[hadoop@spark000 redis-6.0.6]$ make distclean
[hadoop@spark000 redis-6.0.6]$ sudo make
[hadoop@spark000 redis-6.0.6]$ sudo make install
[hadoop@spark000 redis-6.0.6]$ vi redis.conf
daemonize no===>改为===>daemonize yes
pidfile /var/run/redis_6379.pid //(默认端口是6379)
databases 16 //(相当数据库自动创建16个库)
- 启动redis
[hadoop@spark000 redis-6.0.6]$ src/redis-server redis.conf //(启动)
[hadoop@spark000 redis-6.0.6]$ ps -ef|grep redis //(查看启动状态)
[hadoop@spark000 redis-6.0.6]$ src/redis-cli //(连接)
- redis 基础命令
- hash 掌握
- 可以设置字段、字段值(可有多个)
127.0.0.1:6379> keys * //展现当前库中所有的key (empty array) 127.0.0.1:6379> select 1 //使用第二个库 OK 127.0.0.1:6379[1]> select 0 //使用第一个库 OK 127.0.0.1:6379> set user1 jieqiong //设置第一个key value OK 127.0.0.1:6379> set user2 jieqiong OK 127.0.0.1:6379> set user3 jieqiong OK 127.0.0.1:6379> get user1 //获取第一个key所对应的value "jieqiong" 127.0.0.1:6379> keys user[1-4] //展示key名 1) "user3" 2) "user2" 3) "user1" 127.0.0.1:6379> exists user //搜索是否包含此key,有为1,无为0 (integer) 0 127.0.0.1:6379> exists user1 (integer) 1 127.0.0.1:6379> exists user19 (integer) 0 127.0.0.1:6379> keys * //展示当前库中所有的key 1) "user3" 2) "user2" 3) "user1" 127.0.0.1:6379> del user1 user2 user3 //删除多个key (integer) 3 127.0.0.1:6379> keys * (empty array) 127.0.0.1:6379> help //帮助 redis-cli 6.0.6 To get help about Redis commands type: "help @<group>" to get a list of commands in <group> "help <command>" for help on <command> "help <tab>" to get a list of possible help topics "quit" to exit To set redis-cli preferences: ":set hints" enable online hints ":set nohints" disable online hints Set your preferences in ~/.redisclirc 127.0.0.1:6379> help @string //查看string所有用法
127.0.0.1:6379> help @hash //查看hash所有用法
127.0.0.1:6379> hset user:100 name pk
(integer) 1
127.0.0.1:6379> hset user:100 age 30
(integer) 1
127.0.0.1:6379> hget user:100 name
"pk"
127.0.0.1:6379> hget user:100 age
"30"
127.0.0.1:6379> hmset user:100 age 29 gender male
OK
127.0.0.1:6379> hmget user:100 age gender
1) "29"
2) "male"
127.0.0.1:6379> hgetall user:100
1) "name"
2) "pk"
3) "age"
4) "29"
5) "gender"
6) "male"
127.0.0.1:6379> hincrby user:100 age 2
(integer) 31
127.0.0.1:6379> hgetall user:100
1) "name"
2) "pk"
3) "age"
4) "31"
5) "gender"
6) "male"
127.0.0.1:6379> hlen user:100
(integer) 3
127.0.0.1:6379> hvals user:100
1) "pk"
2) "31"
3) "male"
127.0.0.1:6379> hkeys user:100
1) "name"
2) "age"
3) "gender"
127.0.0.1:6379> hdel user:100 gender
(integer) 1
127.0.0.1:6379> hlen user:100
(integer) 2
127.0.0.1:6379> hvals user:100
1) "pk"
2) "31"
127.0.0.1:6379> hkeys user:100
1) "name"
2) "age"
127.0.0.1:6379> hgetall user:100
1) "name"
2) "pk"
3) "age"
4) "31"
127.0.0.1:6379>
通过jedis API 操作Redis
- 仅测试连通
- redis连接超时--Exception in thread "main" redis.clients.jedis.exceptions.JedisConnectionException: Failed connecting to host xxxxx:6379 - 木子小僧 - 博客园 (cnblogs.com)
- C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\RedisUtils.scala
- 结果是在redis中创建了一个key为stu
package com.imooc.spark.sss.project import java.util import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig} object RedisUtils { private val jedisPoolConfig = new JedisPoolConfig() jedisPoolConfig.setMaxTotal(100) //最大连接数 jedisPoolConfig.setMaxIdle(20) //最大空闲 jedisPoolConfig.setMinIdle(20) //最小空闲 jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待 jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒 jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试 private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379) def getJedisClient:Jedis = { jedisPool.getResource } def main(args: Array[String]): Unit = { getJedisClient.set("stu","pk") } }
127.0.0.1:6379> get stu "pk"
- 测试二,API存数据,redis中查
package com.imooc.spark.sss.project import java.util import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig} object RedisUtils { private val jedisPoolConfig = new JedisPoolConfig() jedisPoolConfig.setMaxTotal(100) //最大连接数 jedisPoolConfig.setMaxIdle(20) //最大空闲 jedisPoolConfig.setMinIdle(20) //最小空闲 jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待 jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒 jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试 private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379) def getJedisClient:Jedis = { jedisPool.getResource } def main(args: Array[String]): Unit = { getJedisClient.hset("imooc-user-100","name","pk") getJedisClient.hset("imooc-user-100","age","31") getJedisClient.hset("imooc-user-100","gender","m") } }
127.0.0.1:6379> keys * 1) "user1" 2) "stu" 3) "imooc-user-100" 4) "user:100" 127.0.0.1:6379> hgetall imooc-user-100 1) "name" 2) "pk" 3) "age" 4) "31" 5) "gender" 6) "m"
- 测试三,API取数据,idea的控制台查
package com.imooc.spark.sss.project import java.util import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig} object RedisUtils { private val jedisPoolConfig = new JedisPoolConfig() jedisPoolConfig.setMaxTotal(100) //最大连接数 jedisPoolConfig.setMaxIdle(20) //最大空闲 jedisPoolConfig.setMinIdle(20) //最小空闲 jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待 jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒 jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试 private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379) def getJedisClient:Jedis = { jedisPool.getResource }
def main(args: Array[String]): Unit = { import scala.collection.JavaConversions._ val result: util.Map[String, String] = getJedisClient.hgetAll("imooc-user-100") for((k,v) <- result) { println(k + "-->" + v) } } }
统计结果Sink到Redis
- C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\SSSApp.scala
-
依次开启dfs、yarn、log-web.jar、access-kafka.conf、zookeeper、kafka、consumer、master
- 测试正常,数据成功导出到redis中
package com.imooc.spark.sss.project import java.sql.Timestamp import com.imooc.spark.sss.SourceApp.eventTimeWindow import org.apache.spark.sql.{ForeachWriter, Row, SparkSession} import org.apache.spark.sql.streaming.OutputMode import redis.clients.jedis.Jedis object SSSApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .master("local[2]") // .config("spark.sql.shuffle.partitions","10") .appName(this.getClass.getName) .getOrCreate() // TODO... 从已经保存过的offset中获取 import spark.implicits._ spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "spark000:9092,spark000:9093,spark000:9094") .option("subscribe", "zhang-replicated-topic") //.option("startingOffsets", """{"zhang-replicated-topic":{"0":15000}}""") .load() .selectExpr("CAST(value AS STRING)") .as[String].map(x => { val splits = x.split("\t") val time = splits(0) val ip = splits(2) (new Timestamp(time.toLong), DateUtils.parseToDay(time), IPUtils.parseIP(ip)) }).toDF("ts","day","province") .withWatermark("ts","10 minutes") .groupBy("day","province") .count() .writeStream //.format("console") // 这的console操作是结果显示在控制台 ==> Redis .outputMode(OutputMode.Update()) .foreach(new ForeachWriter[Row] { var client:Jedis = _ override def process(value: Row): Unit = { val day = value.getString(0) val province = value.getString(1) val cnts = value.getLong(2) // val offset = value.getAs[String]("offset") // client.set("","") client.hset("day-province-cnts-"+day, province, cnts+"") } override def close(errorOrNull: Throwable): Unit = { if(null != client) { RedisUtils.returnResource(client) } } override def open(partitionId: Long, epochId: Long): Boolean = { client = RedisUtils.getJedisClient client != null } }) //.option("checkpointLocation","./chk") .start() .awaitTermination() } }
- C:\Users\jieqiong\IdeaProjects\log-time\log-sss\src\main\scala\com\imooc\spark\sss\project\RedisUtils.scala
package com.imooc.spark.sss.project import java.util import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig} object RedisUtils { private val jedisPoolConfig = new JedisPoolConfig() jedisPoolConfig.setMaxTotal(100) //最大连接数 jedisPoolConfig.setMaxIdle(20) //最大空闲 jedisPoolConfig.setMinIdle(20) //最小空闲 jedisPoolConfig.setBlockWhenExhausted(true) //忙碌时是否等待 jedisPoolConfig.setMaxWaitMillis(500) //忙碌时等待时长 毫秒 jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试 private val jedisPool = new JedisPool(jedisPoolConfig, "spark000", 6379) def getJedisClient:Jedis = { jedisPool.getResource } def returnResource(jedis:Jedis): Unit = { jedisPool.returnResource(jedis) } def main(args: Array[String]): Unit = { import scala.collection.JavaConversions._ val result: util.Map[String, String] = getJedisClient.hgetAll("imooc-user-100") for((k,v) <- result) { println(k + "-->" + v) } } }
服务器运行
- 将相应的jar包放入该路径下。
- ip2region-1.7.2.jar、ip2region.db、jedis-2.9.3.jar、log-sss-1.0.jar
[hadoop@spark000 lib]$ pwd /home/hadoop/lib [hadoop@spark000 lib]$ ll total 31984 -rw-rw-r-- 1 hadoop hadoop 16732 Apr 6 16:05 ip2region-1.7.2.jar -rw-rw-r-- 1 hadoop hadoop 8397900 Apr 6 15:45 ip2region.db -rw-rw-r-- 1 hadoop hadoop 563737 Apr 6 16:05 jedis-2.9.3.jar -rw-rw-r-- 1 hadoop hadoop 8077 Mar 29 09:57 log-generator-1.0.jar -rw-rw-r-- 1 hadoop hadoop 3836855 Apr 6 15:39 log-sss-1.0.jar -rw-rw-r-- 1 hadoop hadoop 19834079 Mar 29 09:59 log-web-0.0.1-SNAPSHOT.jar -rw------- 1 hadoop hadoop 78701 Apr 6 10:34 nohup.out
- 脚本
- 运行成功,但结果与本地测试的结果不同
spark-submit \ --master yarn \ --name SSSApp \ --class com.imooc.spark.sss.project.SSSApp \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 \ --jars /home/hadoop/lib/jedis-2.9.3.jar,/home/hadoop/lib/ip2region-1.7.2.jar \ --files /home/hadoop/lib/ip2region.db \ /home/hadoop/lib/log-sss-1.0.jar
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY