|NO.Z.00005|——————————|BigDataEnd|——|Hadoop&OLAP_Druid.V05|——|Druid.v05|入门|从kafka加载流式数据.V1|
一、从Kafka中加载流式数据
### --- 从Kafka中加载流式数据
~~~ 数据及需求说明:Druid典型应用架构:不在Druid中处理复杂的数据转换清洗工作

### --- 假设有以下网络流量数据:
~~~ ts:时间戳
~~~ srcip:发送端 IP 地址
~~~ srcport:发送端端口号
~~~ dstip:接收端 IP 地址
~~~ dstport:接收端端口号
~~~ protocol:协议
~~~ packets:传输包的数量
~~~ bytes:传输字节数
~~~ cost:传输耗费的时间
~~~ # 数据为json格式,通过Kafka传输
~~~ 每行数据包含:时间戳(ts)、维度列、指标列
~~~ 维度列:srcip、srcport、dstip、dstport、protocol
~~~ 指标列:packets、bytes、cost
~~~ # 需要计算的指标:
~~~ 记录的条数:count
~~~ packets:max
~~~ bytes:min
~~~ cost:sum
~~~ # 数据汇总的粒度:分钟
二、准备测试数据
### --- 测试数据:
{"ts":"2021-10-01T00:01:35Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":1, "bytes":1000, "cost": 0.1}
{"ts":"2021-10-01T00:01:36Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":2, "bytes":2000, "cost": 0.1}
{"ts":"2021-10-01T00:01:37Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":3, "bytes":3000, "cost": 0.1}
{"ts":"2021-10-01T00:01:38Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":4, "bytes":4000, "cost": 0.1}
{"ts":"2021-10-01T00:02:08Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":5, "bytes":5000, "cost": 0.2}
{"ts":"2021-10-01T00:02:09Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":6, "bytes":6000, "cost": 0.2}
{"ts":"2021-10-01T00:02:10Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":7, "bytes":7000, "cost": 0.2}
{"ts":"2021-10-01T00:02:11Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":8, "bytes":8000, "cost": 0.2}
{"ts":"2021-10-01T00:02:12Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":9, "bytes":9000, "cost": 0.2}
### --- 最后执行查询:
~~~ # 查询数据
select * from tab;
~~~输出参数 有两个返回值,以下仅为示意
{"ts":"2021-11-01T00:01","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":5, "bytes":1000, "cost": 0.4, "count":4}
{"ts":"2021-11-01T00:02","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":9, "bytes":5000, "cost": 1.0, "count":5}
~~~ # 其他查询
select dstPort, min(packets), max(bytes), sum(count), min(count)
from tab
group by dstPort
三、创建Topic发送消息
### --- 启动Kafka集群,并创建一个名为 "yanqidruid1" 的 Topic :
~~~ # 启动kafka
[root@hadoop01 ~]# kafka-server-start.sh -daemon /opt/yanqi/servers/kafka_2.12/config/server.properties
### --- 创建topic
~~~ # 创建topic
~~~ # --zookeeper hadoop01:2181,hadoop02:2181/myKafka => /myKafka 为namespace,请注意自己是否添加了
[root@hadoop01 ~]# kafka-topics.sh --create --zookeeper hadoop01:2181,hadoop02:2181/myKafka --replication-factor 2 --partitions 6 --topic yanqidruid1
### --- 启动生产者
~~~ # 启动生产者
[root@hadoop01 ~]# kafka-console-producer.sh --broker-list hadoop01:9092,hadoop02:9092 --topic yanqidruid1
~~~加载一条数据
{"ts":"2021-10-01T00:01:35Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":1, "bytes":1000, "cost": 0.1}
{"ts":"2021-10-01T00:02:08Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":5, "bytes":5000, "cost": 0.2}
四、从kafka中摄取数据
### --- 从kafka中摄取数据
~~~ # 浏览器访问 hadoop03:8888,点击控制台中的 Load data
~~~ # Start:选择 Apache Kafka ,点击 Connect data

### --- Connect
~~~ 在 Bootstrap servers 输入hadoop01:9092,hadoop02:9092
~~~ 在 Topic 输入 yanqidruid1
~~~ 点击 Preview 确保看到的数据是正确的
~~~ 后点击"Next: Parse data"进入下一步——Next:Parse Data

### --- Parse data
~~~ 数据加载器将尝试自动为数据确定正确的解析器。可以使用多种解析器解析数据
~~~ 这里使用json 解析器解析数据

### --- Parse time
~~~ 定义数据的主时间列

### --- Tranform
~~~ 不建议在Druid中进行复杂的数据转换操作,可考虑将这些操作放在数据预处理
~~~ 这里没有定义数据转换

### --- Filter
~~~ 不建议在Druid中进行复杂的数据过滤操作,可考虑将这些操作放在数据预处理
~~~ 这里没有定义数据过滤

### --- configure Schema
~~~ 定义指标列、维度列
~~~ 定义如何在维度列上进行计算
~~~ 定义是否在摄取数据时进行数据的合并(即Rollup),以及Rollup的粒度

### --- Partition
~~~ 定义如何进行数据分区
~~~ # Primary partitioning有两种方式
~~~ uniform,以一个固定的时间间隔聚合数据,建议使用这种方式。这里将每天的数据作为一个分区
~~~ arbitrary,尽量保证每个segments大小一致,时间间隔不固定
~~~ # Secondary partitioning
~~~ Max rows per segment,每个Segment最大的数据条数
~~~ Max total rows,Segment等待发布的最大数据条数

### --- Tune
~~~ 定义任务执行和优化的相关参数
### --- Publish
~~~ 定义Datasource的名称
~~~ 定义数据解析失败后采取的动作

### --- Edit spec
~~~ json串为数据摄取规范。可返回之前的步骤中进行更改,
~~~ 也可以直接编辑规范,并在前面的步骤中看到它
~~~ 摄取规范定义完成后,单击 Submit将创建一个数据摄取任务

五、数据查询
### --- 数据查询
### --- 进行数据查询
~~~ 数据摄取规范发布后创建 Supervisor
~~~ Supervisor 会启动一个Task,从Kafka中摄取数据
~~~ 等待一小段时间,Datasource被创建,此时可以进行数据的查询


### --- 在kafka中写入数据
~~~ # 启动生产者
[root@hadoop01 ~]# kafka-console-producer.sh --broker-list hadoop01:9092,hadoop02:9092 --topic yanqidruid1
~~~加载数据
{"ts":"2021-10-01T00:01:35Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":1, "bytes":1000, "cost": 0.1}
{"ts":"2021-10-01T00:01:36Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":2, "bytes":2000, "cost": 0.1}
{"ts":"2021-10-01T00:01:37Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":3, "bytes":3000, "cost": 0.1}
{"ts":"2021-10-01T00:01:38Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":4, "bytes":4000, "cost": 0.1}
{"ts":"2021-10-01T00:02:08Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":5, "bytes":5000, "cost": 0.2}
{"ts":"2021-10-01T00:02:09Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":6, "bytes":6000, "cost": 0.2}
{"ts":"2021-10-01T00:02:10Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":7, "bytes":7000, "cost": 0.2}
{"ts":"2021-10-01T00:02:11Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":8, "bytes":8000, "cost": 0.2}
{"ts":"2021-10-01T00:02:12Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":9, "bytes":9000, "cost": 0.2}
### --- 查询数据
~~~ # 查看全部的数据 --备注:维度相同的数据进行了Rollup
select * from "yanqitable1"
~~~ # 其他查询 --- count字段加引号,表示是一个列名(本质是进行转义,否则认为count是一个函数,将报错)
select dstPort, min(sum_packets), max(min_bytes), sum("count"), min("count") from "yanqitable1" group by dstPort

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
——W.S.Landor
分类:
bdv024-druid
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」