比赛组件的安装与使用
kafka
1.安装及shell操作
STEP 1: GET KAFKA
tar -zxvf kafka_2.12-2.4.1.tgz -C ../export/servers/
cd kafka_2.12-2.4.1/
STEP 2: START THE KAFKA ENVIRONMENT
# Start the ZooKeeper service
$ bin/zookeeper-server-start.sh config/zookeeper.properties
# Start the Kafka broker service
$ bin/kafka-server-start.sh config/server.properties
STEP 3: CREATE A TOPIC TO STORE YOUR EVENTS
这里"events"应翻译中文为"消息"
$ bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092
STEP 4: WRITE SOME EVENTS INTO THE TOPIC
$ bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
# 出错:"bootstrap-server is not a recognized option"
# 2.5版本之前使用"broker-list",之后用"bootstrap-server"
# 当前版本2.4.1
STEP 5: READ THE EVENTS
$ bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
STEP 6: TERMINATE THE KAFKA ENVIRONMENT
Stop the producer and consumer clients with Ctrl-C, if you haven't done so already.
Stop the Kafka broker with Ctrl-C.
Lastly, if the Kafka with ZooKeeper section was followed, stop the ZooKeeper server with Ctrl-C.
rm -rf /tmp/kafka-logs /tmp/zookeeper
2.spark编程读写kafka数据
在Spark1.3版本后,KafkaUtils里面提供了两个创建dstream的方法,一种为KafkaUtils.createDstream,另一种为KafkaUtils.createDirectStream,后者需要Kafka 0.10的Spark Streaming集成,即spark-streaming-kafka-0-10_2.12
-
服务端
OS:CentOS7
IP:10.10.0.100
kafka:2.4.1
(1)开启zookeeper服务
(2)kafka服务 -
开发端
OS:Ubuntu18
IDE:IDEA2022
Framework:spark3.1.1
Language:scala2.12
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>demo3</artifactId>
<version>1.0-SNAPSHOT</version>
<name>Archetype - demo3</name>
<url>http://maven.apache.org</url>
<properties>
<spark-version>3.1.1</spark-version>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark-version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>${spark-version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>${spark-version}</version>
</dependency>
</dependencies>
</project>
2.1 从kafka读数据
2.1.1 生产数据
kafka生产数据
(1)创建主题WordCount
bin/kafka-topics.sh --create --topic WordCount --bootstrap-server localhost:9092
(2)初始化生产者,产生WordCount主题的数据
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic WordCount
2.1.2 消费数据
spark编程消费数据
Spark02_ReadFromKafka.scala
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Spark02_ReadFromKafka {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR) //display error only
val conf = new SparkConf().setAppName("StreamWordCount").setMaster("local[*]")
val sc = new StreamingContext(conf, Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "10.10.0.100:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1"
)
val topics = Array("WordCount")
val stream = KafkaUtils.createDirectStream[String, String](
sc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val kafkaStream = stream.map(record => (record.key, record.value))
val lines = kafkaStream.map(_._2)
val words = lines.flatMap(_.split(" "))
val pairs = words.map { x => (x, 1) }
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
sc.start()
sc.awaitTermination()
}
}
2.2 向kafka写数据
2.2.1 生产数据
spark编程生产数据
项目根目录新建数据目录及文件:data/sample.log,输入若干行数据
Spark03_WriteToKafka.scala
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import java.util.Properties
object Spark03_WriteToKafka {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("write-to-kafka").setMaster("local[*]")
val sc = new SparkContext(conf)
// 定义 kafka producer参数
val lines: RDD[String] = sc.textFile("data/sample.log")
// 定义 kafka producer参数
val prop = new Properties()
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.10.0.100:9092")
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer])
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer])
// 将读取到的数据发送到mytopic1
lines.foreachPartition { iter =>
// KafkaProducer
val producer = new KafkaProducer[String, String](prop)
iter.foreach { line =>
val record = new ProducerRecord[String, String]("mytopic1", line)
producer.send(record)
}
producer.close()
}
}
}
2.2.2 消费数据
打开kafka消费终端消费数据
bin/kafka-console-consumer.sh --topic mytopic1 --from-beginning --bootstrap-server localhost:9092
MySQL
官网:https://www.mysql.com/
需求:在centos7下安装mysql5.7,如何实现?
-
方案1:yum
直接yum install community-server
,但安装的是最新版8.x -
方案2:rpm
下载mysql5.7的rpm包,使用rpm -ivh xxx.rpm
,但需要自己解决安装包之间的依赖。 -
方案3:tar
下载mysql5.7的tar包,使用tar命令解压,但需要自己配置mysql。 -
最终方案:自定义yum方式
1. 下载yum源
wget http://dev.mysql.com/get/mysql57-community-release-el7-8.noarch.rpm
PS:如果没有装wget,也可直接使用浏览器下载此rpm包
2. 安装mysql源
yum localinstall mysql57-community-release-el7-8.noarch.rpm
yum repolist enabled
#检查
3. 修改mysql-community.repo源,改变默认安装的mysql版本
需要安装哪个版本,就把该版本的enabled设置为1
PS:本例中5.7的enabled默认为1,所以这一步可跳过。
4. 安装mysql
yum -y install mysql-community-server --nogpgcheck
5. 启动MySQL服务
systemctl start mysqld
systemctl status mysqld
systemctl enable mysqld
# 开机启动
systemctl daemon-reload
6. 修改root本地登录密码
(1)得到临时密码
grep 'temporary password' /var/log/mysqld.log
(2)使用临时密码登录
mysql -uroot -p
***
(3)MySQL密码安全策略:
设置密码强度为低级:
set global validate_password_policy=0;
设置密码长度:
set global validate_password_length=4;
(4)更改root用户密码
alter user 'root'@'localhost' identified by 'root';
7. 配置默认字符集编码为utf8
vi /etc/my.cnf
[mysqld]
character_set_server=utf8
init_connect='SET NAMES utf8'
默认配置文件路径:
配置文件:/etc/my.cnf
日志文件:/var/log/mysqld.log
服务启动脚本:/usr/lib/systemd/system/mysqld.service
socket文件:/var/run/mysqld/mysqld.pid
Maxwell
官网:https://maxwells-daemon.io/
Maxwell是一个通过MySQL的binlog日志而采集变化数据的程序,可以将变化数据以json格式输出到诸如kafka等流平台中。
安装
OS:centos7
maxwell-1.29.2.tar.gz
tar -zxvf ~/package/maxwell-1.29.2.tar.gz -C ~/export/servers/
vi /etc/profile
export MAXWELL_HOME=/root/export/servers/maxwell-1.29.2
export PATH=$PATH:$MAXWELL_HOME/bin
source /etc/profile
配置MySQL
vi /etc/my.cnf
# /etc/my.cnf
[mysqld]
# maxwell needs binlog_format=row
# maxwell 需要指定 Binlog 日志以"行级别"的方式进行记录
binlog_format=row
#MySQL服务器的唯一标识号
server_id=1
#启用二进制日志 Binlog,指定 "master" 作为 Binlog 文件的前缀名
log-bin=master
配置maxwell
PS:也可不配,但启动时需要手动指定启动参数
cd $MAXWELL_HOME
cp config.properties.example config.properties
vi config.properties
############ 添加如下信息 ############
# 指定生产者对象
producer=kafka
# 指定 kafka 目标机器
kafka.bootstrap.servers=localhost:9092
# 指定 kafka topic
kafka_topic=maxwell
# 指定 mysql 连接信息
host=localhost
user=root
password=root
jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai&allowPublicKeyRetrieval=true
# 指定数据按照主键分组进入 kafka 不同分区,避免数据倾斜
producer_partition_by=primary_key
启动
1. 启动zookeeper
新开一个终端
cd $KAFKA_HOME
bin/zookeeper-server-start.sh config/zookeeper.properties
2. 启动kafka
新开一个终端
cd $KAFKA_HOME
bin/kafka-server-start.sh config/server.properties
3. 启动maxwell
新开一个终端
-
方式1:参数配在文件里
bin/maxwell --config config.properties --daemon
-
方式2:手动指定参数
bin/maxwell --user=root --password=root --host=localhost \ --producer=kafka --kafka.bootstrap.servers=localhost:9092 --kafka_topic=maxwell
# maxwell采集mysql变动的数据,输出去kafka的指定的主题"maxwell"上,作为生产者
4. 启动kafka消费控制台
新开一个终端
cd $KAFKA_HOME
bin/kafka-console-consumer.sh --topic maxwell --from-beginning --bootstrap-server localhost:9092
5. mysql变动数据
新开一个终端
更新(增、删、改)任意数据库里的数据
6. 观察消费者
点击查看结果
{"database":"test_db","table":"person","type":"insert","ts":1702388752,"xid":1339,"commit":true,"data":{"id":1,"name":"zerg"}}
{"database":"test_db","table":"person","type":"insert","ts":1702389034,"xid":2043,"commit":true,"data":{"id":2,"name":"Terren"}}
{"database":"test_db","table":"person","type":"insert","ts":1702389692,"xid":3365,"commit":true,"data":{"id":3,"name":"Protoss"}}
{"database":"test_db","table":"person","type":"update","ts":1702389795,"xid":3628,"commit":true,"data":{"id":1,"name":"Zerg"},"old":{"name":"zerg"}}
{"database":"test_db","table":"person","type":"delete","ts":1702390228,"xid":4713,"commit":true,"data":{"id":1,"name":"Zerg"}}
Redis
官网:https://redis.io/
OS:Centos7
安装包:~/package/redis-6.2.6.tar.gz # 没有编译的源安装包
安装
1. 解包
cd ~
tar redis-6.2.6.tar.gz -c ../export/servers/
2. 编译
cd /root/export/servers/redis-6.2.6
# 进入redis主目录
make
#编译
出错:cc:未找到命令
原因:Centos没有安装C编译器
解决方法:
yum -y install gcc automake autoconf libtool make
make
#继续编译
出错:zmalloc.h:50:31: 致命错误:jemalloc/jemalloc.h:没有那个文件或目录
解决方法:
删除redis解压后的目录,重新解压、编译,直至成功。
3. 安装
make install
# 将二进制包安装到/usr/local/bin
4. 测试
- 启动redis服务
redis-server
点击查看结果
_._
_.-``__ ''-._
_.-`` `. `_. ''-._ Redis 6.2.6 (00000000/0) 64 bit
.-`` .-```. ```\/ _.,_ ''-._
( ' , .-` | `, ) Running in standalone mode
|`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
| `-._ `._ / _.-' | PID: 6348
`-._ `-._ `-./ _.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' | https://redis.io
`-._ `-._`-.__.-'_.-' _.-'
|`-._`-._ `-.__.-' _.-'_.-'|
| `-._`-._ _.-'_.-' |
`-._ `-._`-.__.-'_.-' _.-'
`-._ `-.__.-' _.-'
`-._ _.-'
`-.__.-'
- 关闭redis服务
redis-cli shutdown
编程
- 服务端:Centos (已安装redis服务)
- 开发端:ubuntu (IDEA)
(1)更改Redis的配置
cd ~/export/servers/redis-6.2.6
# 进入redis主目录
vi redis.conf
# 关闭保护模式
protected-mode no #默认是yes
# 取消绑定本地地址,允许外部访问
#bind 127.0.0.1 -::1 # 注释这一行
(2)编写Java程序
a. pom.xml文件添加Java连接Redis的依赖
<!-- https://mvnrepository.com/artifact/redis.clients/jedis -->
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
b.新建Java程序
Redis_Test.java
import redis.clients.jedis.Jedis;
public class Redis_Test {
public static void main(String[] args) {
Jedis jedis = new Jedis("10.10.0.100",6379);
jedis.set("myname","mike");
System.out.println(jedis.get("myname"));
}
}
运行:
mike
c. redis命令行客户端访问
redis-cli
127.0.0.1:6379> get myname
"mike"
ClickHouse
官网:https://clickhouse.com/
ClickHouse是用于实时应用和分析的最快、资源效率最高的开源数据库。
ClickHouse®是一个高性能,面向列的SQL数据库管理系统(DBMS),用于在线分析处理(OLAP)。它既可以作为开源软件,也可以作为云服务。
1.安装
OS:CentOS7
版本:ClickHouse 21.9.4
下载安装包:
clickhouse-server-21.9.4.35-2.noarch.rpm 113K
clickhouse-common-static-21.9.4.35-2.x86_64.rpm 183M
clickhouse-client-21.9.4.35-2.noarch.rpm 90k
(1)安装静态包
rpm -ivh clickhouse-common-static-21.9.4.35-2.x86_64.rpm
(2)安装服务器
yum localinstall clickhouse-server-21.9.4.35-2.noarch.rpm
Password for default user is empty string. See /etc/clickhouse-server/users.xml and /etc/clickhouse-server/users.d to change it.
(3)安装客户端
yum localinstall clickhouse-client-21.9.4.35-2.noarch.rpm
(4)开启服务
systemctl enable clickhouse-server
#允许开机启动
systemctl start clickhouse-server
# 启动服务
systemctl status clickhouse-server
#查看服务状态
(5)客户端连接
clickhouse-client
点击查看结果
ClickHouse client version 21.9.4.35 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.9.4 revision 54449.
hadoop01 :)
2.使用
命令行客户端
clickhouse-client -m
# 允许多行编辑
官方demo:
(1)创建表
点击查看代码
CREATE TABLE my_first_table
(
user_id UInt32,
message String,
timestamp DateTime,
metric Float32
)
ENGINE = MergeTree
PRIMARY KEY (user_id, timestamp)
(2)插入数据
点击查看代码
INSERT INTO my_first_table (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0 ),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 )
(3)查询
SELECT * FROM my_first_table ORDER BY timestamp
编程客户端
(1)准备数据
数据库:starcraft
表:race
数据:
┌─id─┬─name────┐
│ 1 │ Zerg │
│ 2 │ Terren │
│ 3 │ Protoss │
└────┴─────────┘
(2)配置依赖
pom.xml
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.1</version>
<!-- <scope>provided</scope>-->
<!--provided作用是:编译环境可用,运行时则不可用,主要目的是打包时不将这些依赖打进来,因为服务器上有,以免包过大.但是,本地运行时,却需要这些依赖,因为本地时没有这些依赖的,所以报错。解决办法就是去除provided即可 -->
</dependency>
<!-- https://mvnrepository.com/artifact/com.clickhouse/clickhouse-jdbc -->
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.4.0</version>
</dependency>
(3)spark读ClickHouse的数据
ClickHouse_Read.scala
package cn.beast.study
import org.apache.spark.sql.SparkSession
object ClickHouse_Read {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("ReadTest").master("local[*]").getOrCreate()
spark.read.format("jdbc")
.option("driver", "com.clickhouse.jdbc.ClickHouseDriver") // clickhouse 连接驱动
.option("url", "jdbc:clickhouse://10.10.0.100:8123/starcraft") // 指定 clickhouse 连接库
.option("user", "") // 用户名
.option("password", "") // 密码
.option("dbtable", "race") // 连接表
.load()
.orderBy("id")
.show()
spark.stop()
}
}
# 如果出现拒绝连接的话,修改"/etc/clickhouse-server/config.xml".
# 将"<!-- <listen_host>::</listen_host> -->"的注释取消,允许IP4和IP6地址
(4)Spark向ClickHouse写数据
ClickHouse_Write.scala
package cn.beast.study
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
object ClickHouse_Write {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("WriteTest").master("local[*]").getOrCreate()
spark.read
.text("data/race.txt")
.createOrReplaceTempView("data")
val dataFrame: DataFrame = spark.sql(
"""
|select
| split(value," ")[0] id,
| split(value," ")[1] name
|from data
|""".stripMargin)
dataFrame.write.format("jdbc")
.option("driver", "com.clickhouse.jdbc.ClickHouseDriver") // clickhouse 连接驱动
.option("url", "jdbc:clickhouse://10.10.0.100:8123/starcraft") // 指定 clickhouse 连接库
.option("user", "") // 用户名
.option("password", "") // 密码
.option("dbtable", "race") // 连接表,该表必须存在!否则会报错
.mode(SaveMode.Append) // 以追加的模式写入数据
.save()
spark.stop()
}
}
Scala
官网:https://www.scala-lang.org/
1.安装
OS:Ubuntu18
Scala版本:2.12.18
- 下载
wget https://www.scala-lang.org/files/archive/scala-2.12.18.deb
- 安装
sudo dpkg -i scala-2.12.18.deb
dpkg -L scala
#查找Scala安装路径
2.使用
Hive
1. 安装
tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /software/
2. 配置
(1)配置环境变量
vi /etc/bashrc
export HIVE_HOME=/software/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin
source /etc/bashrc
(2)配置hive-site.xml
外模式需要配置mysql,如果是内模式(derby)无需配置
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
vi hive-site.xml
:set nu #设置行号
:3:$d #自第3行至最后一行全部删除
点击查看代码
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://10.10.0.100:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
3. 复制mysql的驱动
docker cp mysql-connector-java-5.1.39-bin.jar hadoop1:/software/apache-hive-3.1.2-bin/lib
4. 初始化元数据
./bin/schematool -dbType mysql -initSchema
报错1:
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.base.Preconditions.checkArgument
原因:
这是因为hive内依赖的guava.jar和hadoop内的版本不一致造成的。
解决方法:
1.查看hadoop安装目录下share/hadoop/common/lib内guava.jar版本
2.查看hive安装目录下lib内guava.jar的版本,如果两者不一致,拷贝高版本去替代较低版本!
再次初始化
5. 启动hive
(1)启动hadoop
start-all.sh
(2)启动hive
hive
Hive Session ID = 9e311256-b2af-4c85-a9b7-23fc38d1baef
hive> show databases;
OK
default
Time taken: 1.47 seconds, Fetched: 1 row(s)
hive> create database test_db1;
OK
Time taken: 0.459 seconds
Zookeeper
1. 安装
tar -zxvf apache-zookeeper-3.5.5-bin -C /software/
2. 配置
(1)环境变量
export ZOOKEEPER_HOME=/software/apache-zookeeper-3.5.5-bin
export PATH=$PATH:$ZOOKEEPER_HOME/bin
(2)配置zoo.cfg
# 先创建文件夹
cd /software/apache-zookeeper-3.5.5-bin/
mkdir data log
# 复制zoo_sample.cfg:
cd conf
cp zoo_sample.cfg zoo.cfg
vim zoo.cfg
dataDir=/software/apache-zookeeper-3.5.5-bin/data
dataLogDir=/software/apache-zookeeper-3.5.5-bin/log
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888
(3)创建myid文件
在dataDir对应的目录下创建myid并输入内容:
cd /software/apache-zookeeper-3.5.5-bin/data
echo '1' >myid
(4)复制zookeeper-3.5.5到hadoop2、hadoop3
scp -r /software/apache-zookeeper-3.5.5-bin hadoop2:/software/
# 修改hadoop2的data目录下myid为2
vi /software/apache-zookeeper-3.5.5-bin/data/myid
2
scp -r /software/apache-zookeeper-3.5.5-bin hadoop3:/software/
# 修改hadoop3的data目录下myid为2
vi /software/apache-zookeeper-3.5.5-bin/data/myid
3
3. 启动
# 在三台机器上分别启动zkServer
zkServer.sh start
# 查看进程
jps
3048 QuorumPeerMain
# 查看状态
zkServer.sh status
# follow or leader
Hbase
1. 安装
tar -zxvf hbase-2.2.3-bin.tar.gz -C /software/
2. 配置
(1)环境变量
export HBASE_HOME=/software/hbase-2.2.3
export PATH=$PATH:$HBASE_HOME/bin
(2)配置hbase-env.sh
cd $HBASE_HOME/conf # 进入配置目录:
vi hbase-env.sh
export JAVA_HOME=/software/jdk1.8.0_181
export HBASE_MANAGES_ZK=false
(3)配置hbase-site.xml
vi hbase.site.xml
点击查看代码
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop1:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop1,hadoop2,hadoop3</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/software/apache-zookeeper-3.5.5-bin/data</value>
</property>
</configuration>
(4)配置regionservers
vi regionservers
hadoop2
hadoop3
# 说明:regionservers文件中列出了你希望运行的HRegionServer
(5)复制hbase-2.2.3到hadoop2,hadoop3
scp -r /software/hbase-2.2.3 hadoop2:/software/
scp -r /software/hbase-2.2.3 hadoop3:/software/
3. 启动
# 启动hadoop集群
start-all.sh
# 启动zookeeper,注意:三台都要启动
zkServer.sh start
# 启动hbase
start-hbase.sh
# 查看进程
jps
# hadoop1进程
2225 NameNode
2564 ResourceManager
4245 Jps
4070 HMaster(启动hbase增加的)
3048 QuorumPeerMain(启动zookeeper增加的)
2413 SecondaryNameNode
# hadoop2进程
2057 DataNode
2749 HRegionServer(启动hbase增加的)
2367 QuorumPeerMain(启动zookeeper增加的)
2895 Jps
# web UI
http://10.10.0.100:16010
# 启动hbase shell
hbase shell
# 退出hbase shell
exit
# 关闭hbase
stop-hbase.sh
Spark
-
standalone模式
安装
略
配置
- 环境变量
# standalone
export SPARK_HOME=/software/spark-3.1.1-bin-hadoop3.2
- workers
hadoop2
hadoop3
复制到其它节点
scp -r /software/spark-3.1.1-bin-hadoop3.2 hadoop2:/software/
scp -r /software/spark-3.1.1-bin-hadoop3.2 hadoop3:/software/
启动
cd $SPARK_HOME/
bin/start-all.sh
hadoop1进程:
216 Jps
61 Master
hadoop2进程:
326 Jps
55 Worker
spark-shell
bin/spark-shell --master spark://hadoop1:7077
点击查看代码
Spark context Web UI available at http://a16e70f1373e:4040
Spark context available as 'sc' (master = spark://hadoop1:7077, app id = app-20240111081519-0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
scala> print("hello,spark!")
hello,spark!
spark-submit
官方例子,计算圆周率
bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://hadoop1:7077 examples/jars/spark-examples_2.12-3.1.1.jar
24/01/11 08:10:19 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 23.461299 s
Pi is roughly 3.1472157360786803
webUI
http://10.10.0.100:8080
Application ID Name Cores Memory per Executor Resources Per Executor Submitted Time User State Duration
app-20240111080947-0000 Spark Pi 2 1024.0 MiB 2024/01/11 08:09:47 root FINISHED 33 s
-
yarn(混合)模式
无需启动spark
启动hadoop
$>spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples_2.12-3.1.1.jar
Flink
-
本地模式
- 解压
tar -xzvf flink-1.14.0-bin-scala_2.12.tgz
cd flink-1.14.0
- 启动
./bin/start-cluster.sh
- 提交作业
# flink自带的程序
./bin/flink run examples/streaming/WordCount.jar
tail log/flink-root-taskexecutor-0-hadoop01.out
# (nymph,1)
# (in,3)
...
- 停止
./bin/stop-cluster.sh
-
standalone模式
- 解压
tar -zxvf flink-1.14.0-bin-scala_2.12.tgz -C /software/
- 配置
- conf/flink-conf.yaml
jobmanager.rpc.address=hadoop1
- conf/workers
hadoop2
hadoop3
- 复制
scp -r flink-1.14.0/ hadoop2:/software/
scp -r flink-1.14.0/ hadoop3:/software/
- 启动
bin/start-cluster.sh
jps
点击查看结果
# hadoop1
633 Jps
61 Master
590 StandaloneSessionClusterEntrypoint
# hadoop2
678 Jps
55 Worker
603 TaskManagerRunner
- 测试
# flink自带的程序
./bin/flink run examples/streaming/WordCount.jar
# 处理结果在taskmanager节点上(hadoop2或hadoop3)
vi log/flink-root-taskexecutor-0-a16e70f1373e.out
-
yarn模式
- Starting a Flink Session on YARN
- 配置hadoop类路径
export HADOOP_CLASSPATH=`hadoop classpath`
- 启动YARN session
./bin/yarn-session.sh --detached
- 提交示例作业
./bin/flink run ./examples/streaming/TopSpeedWindowing.jar
- 停止YARN Session
echo "stop" | ./bin/yarn-session.sh -id application_1705020676638_0001
Flume
- demo
- 编写配置文件
conf/example.conf
点击查看代码
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
-
启动flume
bin/flume-ng agent --conf conf --conf-file conf/example.conf --name a1
-
发送消息
nc -l localhost 44444
- demo2
本题来源于全国职业技能大赛之大数据技术赛项电商赛题 - 实时数据采集
题目
1、在Master节点使用Flume采集实时数据生成器25001端口的socket数据(实时数据生成器脚本为Master节点/data_log目录下的gen_ds_data_to_socket.脚本,该脚本为Master节点本地部署且使用socket传输),将数据存入到Kafka 的Topic 中 (Topic名称为ods_mall_og,分区数为4),使用Kafka自带的消费者消费ods_mal_log(Topic)中的数据,查看前⒉条数据的结果;
注:需先启动已配置好的Flume 再启动脚本,否则脚本将无法成功启动,启动方式为进入/data_1og,目录执行./gen_ds_data_to_socket(如果没有权限,请执行授权命令chmod 777 /data_log/gen_ds_data to_socket)
- 创建主题
先启动kafka服务
bin/kafka-topics.sh --create --topic ods_mall_log --bootstrap-server localhost:9092 --partitions 4 --replication-factor 1
- 配置flume
conf/flume1.conf
点击查看代码
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 25001
# Describe the sink KafkaSink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.topic = ods_mall_log
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
-
发送数据
nc -l localhost 25001
-
启动flume采集数据
bin/flume-ng agent --name a1 --conf conf --conf-file conf/flume1.conf -Dflume.root.logger=INFO,console
-
kafka消费数据
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic ods_mall_log