比赛组件的安装与使用

kafka

官网:https://kafka.apache.org/

1.安装及shell操作

STEP 1: GET KAFKA
tar -zxvf kafka_2.12-2.4.1.tgz -C ../export/servers/
cd kafka_2.12-2.4.1/
STEP 2: START THE KAFKA ENVIRONMENT

# Start the ZooKeeper service
$ bin/zookeeper-server-start.sh config/zookeeper.properties
# Start the Kafka broker service
$ bin/kafka-server-start.sh config/server.properties

STEP 3: CREATE A TOPIC TO STORE YOUR EVENTS
这里"events"应翻译中文为"消息"

$ bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092

STEP 4: WRITE SOME EVENTS INTO THE TOPIC

$ bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
# 出错:"bootstrap-server is not a recognized option"
# 2.5版本之前使用"broker-list",之后用"bootstrap-server"
# 当前版本2.4.1

STEP 5: READ THE EVENTS

$ bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092

STEP 6: TERMINATE THE KAFKA ENVIRONMENT

Stop the producer and consumer clients with Ctrl-C, if you haven't done so already.
Stop the Kafka broker with Ctrl-C.
Lastly, if the Kafka with ZooKeeper section was followed, stop the ZooKeeper server with Ctrl-C.

rm -rf /tmp/kafka-logs /tmp/zookeeper

2.spark编程读写kafka数据

在Spark1.3版本后,KafkaUtils里面提供了两个创建dstream的方法,一种为KafkaUtils.createDstream,另一种为KafkaUtils.createDirectStream,后者需要Kafka 0.10的Spark Streaming集成,即spark-streaming-kafka-0-10_2.12

  • 服务端
    OS:CentOS7
    IP:10.10.0.100
    kafka:2.4.1
    (1)开启zookeeper服务
    (2)kafka服务

  • 开发端
    OS:Ubuntu18
    IDE:IDEA2022
    Framework:spark3.1.1
    Language:scala2.12

pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.example</groupId>
  <artifactId>demo3</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>Archetype - demo3</name>
  <url>http://maven.apache.org</url>

  <properties>
    <spark-version>3.1.1</spark-version>
  </properties>

  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>${spark-version}</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.12</artifactId>
      <version>${spark-version}</version>
<!--      <scope>provided</scope>-->
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
      <version>${spark-version}</version>
    </dependency>

  </dependencies>
</project>

2.1 从kafka读数据

2.1.1 生产数据

kafka生产数据
(1)创建主题WordCount
bin/kafka-topics.sh --create --topic WordCount --bootstrap-server localhost:9092
(2)初始化生产者,产生WordCount主题的数据
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic WordCount

2.1.2 消费数据

spark编程消费数据

Spark02_ReadFromKafka.scala
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}


object Spark02_ReadFromKafka {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)  //display error only
    val conf = new SparkConf().setAppName("StreamWordCount").setMaster("local[*]")
    val sc = new StreamingContext(conf, Seconds(10))

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "10.10.0.100:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "group1"
    )

    val topics = Array("WordCount")
    val stream = KafkaUtils.createDirectStream[String, String](
      sc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    val kafkaStream = stream.map(record => (record.key, record.value))
    val lines = kafkaStream.map(_._2)
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map { x => (x, 1) }
    val wordCounts = pairs.reduceByKey(_ + _)
    wordCounts.print()

    sc.start()
    sc.awaitTermination()
  }
}

2.2 向kafka写数据

2.2.1 生产数据

spark编程生产数据
项目根目录新建数据目录及文件:data/sample.log,输入若干行数据

Spark03_WriteToKafka.scala
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import java.util.Properties

object Spark03_WriteToKafka {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val conf = new SparkConf().setAppName("write-to-kafka").setMaster("local[*]")
    val sc = new SparkContext(conf)

    // 定义 kafka producer参数
    val lines: RDD[String] = sc.textFile("data/sample.log")

    // 定义 kafka producer参数
    val prop = new Properties()
    prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.10.0.100:9092")
    prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer])
    prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer])

    // 将读取到的数据发送到mytopic1
    lines.foreachPartition { iter =>
      // KafkaProducer
      val producer = new KafkaProducer[String, String](prop)
      iter.foreach { line =>
        val record = new ProducerRecord[String, String]("mytopic1", line)
        producer.send(record)
      }
      producer.close()
    }
  }
}

2.2.2 消费数据

打开kafka消费终端消费数据
bin/kafka-console-consumer.sh --topic mytopic1 --from-beginning --bootstrap-server localhost:9092

MySQL

官网:https://www.mysql.com/
需求:在centos7下安装mysql5.7,如何实现?

  • 方案1:yum
    直接yum install community-server,但安装的是最新版8.x

  • 方案2:rpm
    下载mysql5.7的rpm包,使用rpm -ivh xxx.rpm,但需要自己解决安装包之间的依赖。

  • 方案3:tar
    下载mysql5.7的tar包,使用tar命令解压,但需要自己配置mysql。

  • 最终方案:自定义yum方式

1. 下载yum源

wget http://dev.mysql.com/get/mysql57-community-release-el7-8.noarch.rpm
PS:如果没有装wget,也可直接使用浏览器下载此rpm包

2. 安装mysql源

yum localinstall mysql57-community-release-el7-8.noarch.rpm
yum repolist enabled #检查

3. 修改mysql-community.repo源,改变默认安装的mysql版本

需要安装哪个版本,就把该版本的enabled设置为1

PS:本例中5.7的enabled默认为1,所以这一步可跳过。

4. 安装mysql

yum -y install mysql-community-server --nogpgcheck

5. 启动MySQL服务

systemctl start mysqld
systemctl status mysqld
systemctl enable mysqld # 开机启动
systemctl daemon-reload

6. 修改root本地登录密码

(1)得到临时密码
grep 'temporary password' /var/log/mysqld.log
(2)使用临时密码登录
mysql -uroot -p
***
(3)MySQL密码安全策略:
设置密码强度为低级:
set global validate_password_policy=0;
设置密码长度:
set global validate_password_length=4;
(4)更改root用户密码
alter user 'root'@'localhost' identified by 'root';

7. 配置默认字符集编码为utf8

vi /etc/my.cnf

[mysqld]
character_set_server=utf8
init_connect='SET NAMES utf8'

默认配置文件路径:
配置文件:/etc/my.cnf
日志文件:/var/log/mysqld.log
服务启动脚本:/usr/lib/systemd/system/mysqld.service
socket文件:/var/run/mysqld/mysqld.pid

Maxwell

官网:https://maxwells-daemon.io/
Maxwell是一个通过MySQL的binlog日志而采集变化数据的程序,可以将变化数据以json格式输出到诸如kafka等流平台中。

安装

OS:centos7
maxwell-1.29.2.tar.gz
tar -zxvf ~/package/maxwell-1.29.2.tar.gz -C ~/export/servers/
vi /etc/profile

export MAXWELL_HOME=/root/export/servers/maxwell-1.29.2
export PATH=$PATH:$MAXWELL_HOME/bin

source /etc/profile

配置MySQL

vi /etc/my.cnf

# /etc/my.cnf

[mysqld]
# maxwell needs binlog_format=row
# maxwell 需要指定 Binlog 日志以"行级别"的方式进行记录
binlog_format=row
#MySQL服务器的唯一标识号
server_id=1 
#启用二进制日志 Binlog,指定 "master" 作为 Binlog 文件的前缀名
log-bin=master

配置maxwell

PS:也可不配,但启动时需要手动指定启动参数
cd $MAXWELL_HOME
cp config.properties.example config.properties
vi config.properties

############ 添加如下信息 ############

# 指定生产者对象
producer=kafka

# 指定 kafka 目标机器
kafka.bootstrap.servers=localhost:9092

# 指定 kafka topic
kafka_topic=maxwell

# 指定 mysql 连接信息
host=localhost
user=root
password=root
jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai&allowPublicKeyRetrieval=true

# 指定数据按照主键分组进入 kafka 不同分区,避免数据倾斜
producer_partition_by=primary_key

启动

1. 启动zookeeper

新开一个终端
cd $KAFKA_HOME
bin/zookeeper-server-start.sh config/zookeeper.properties

2. 启动kafka

新开一个终端
cd $KAFKA_HOME
bin/kafka-server-start.sh config/server.properties

3. 启动maxwell

新开一个终端

  • 方式1:参数配在文件里
    bin/maxwell --config config.properties --daemon

  • 方式2:手动指定参数
    bin/maxwell --user=root --password=root --host=localhost \ --producer=kafka --kafka.bootstrap.servers=localhost:9092 --kafka_topic=maxwell

# maxwell采集mysql变动的数据,输出去kafka的指定的主题"maxwell"上,作为生产者

4. 启动kafka消费控制台

新开一个终端
cd $KAFKA_HOME
bin/kafka-console-consumer.sh --topic maxwell --from-beginning --bootstrap-server localhost:9092

5. mysql变动数据

新开一个终端
更新(增、删、改)任意数据库里的数据

6. 观察消费者

点击查看结果
{"database":"test_db","table":"person","type":"insert","ts":1702388752,"xid":1339,"commit":true,"data":{"id":1,"name":"zerg"}}
{"database":"test_db","table":"person","type":"insert","ts":1702389034,"xid":2043,"commit":true,"data":{"id":2,"name":"Terren"}}
{"database":"test_db","table":"person","type":"insert","ts":1702389692,"xid":3365,"commit":true,"data":{"id":3,"name":"Protoss"}}
{"database":"test_db","table":"person","type":"update","ts":1702389795,"xid":3628,"commit":true,"data":{"id":1,"name":"Zerg"},"old":{"name":"zerg"}}
{"database":"test_db","table":"person","type":"delete","ts":1702390228,"xid":4713,"commit":true,"data":{"id":1,"name":"Zerg"}}

Redis

官网:https://redis.io/
OS:Centos7
安装包:~/package/redis-6.2.6.tar.gz # 没有编译的源安装包

安装

1. 解包

cd ~
tar redis-6.2.6.tar.gz -c ../export/servers/

2. 编译

cd /root/export/servers/redis-6.2.6 # 进入redis主目录
make #编译

出错:cc:未找到命令

原因:Centos没有安装C编译器
解决方法:
yum -y install gcc automake autoconf libtool make
make #继续编译

出错:zmalloc.h:50:31: 致命错误:jemalloc/jemalloc.h:没有那个文件或目录

解决方法:
删除redis解压后的目录,重新解压、编译,直至成功。

3. 安装

make install # 将二进制包安装到/usr/local/bin

4. 测试

  • 启动redis服务
    redis-server
点击查看结果
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 6.2.6 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                  
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 6348
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           https://redis.io       
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-' 
  • 关闭redis服务
    redis-cli shutdown

编程

  • 服务端:Centos (已安装redis服务)
  • 开发端:ubuntu (IDEA)

(1)更改Redis的配置

cd ~/export/servers/redis-6.2.6 # 进入redis主目录
vi redis.conf

# 关闭保护模式
protected-mode no #默认是yes
# 取消绑定本地地址,允许外部访问
#bind 127.0.0.1 -::1 # 注释这一行

(2)编写Java程序

a. pom.xml文件添加Java连接Redis的依赖

<!-- https://mvnrepository.com/artifact/redis.clients/jedis -->
<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>2.9.0</version>
</dependency>

b.新建Java程序

Redis_Test.java
import redis.clients.jedis.Jedis;

public class Redis_Test {
    public static void main(String[] args) {
        Jedis jedis = new Jedis("10.10.0.100",6379);
        jedis.set("myname","mike");
        System.out.println(jedis.get("myname"));
    }
}

运行:

mike

c. redis命令行客户端访问

redis-cli

127.0.0.1:6379> get myname
"mike"

ClickHouse

官网:https://clickhouse.com/
ClickHouse是用于实时应用和分析的最快、资源效率最高的开源数据库。
ClickHouse®是一个高性能,面向列的SQL数据库管理系统(DBMS),用于在线分析处理(OLAP)。它既可以作为开源软件,也可以作为云服务。

1.安装

OS:CentOS7
版本:ClickHouse 21.9.4
下载安装包:
clickhouse-server-21.9.4.35-2.noarch.rpm 113K
clickhouse-common-static-21.9.4.35-2.x86_64.rpm 183M
clickhouse-client-21.9.4.35-2.noarch.rpm 90k
(1)安装静态包
rpm -ivh clickhouse-common-static-21.9.4.35-2.x86_64.rpm
(2)安装服务器
yum localinstall clickhouse-server-21.9.4.35-2.noarch.rpm

Password for default user is empty string. See /etc/clickhouse-server/users.xml and /etc/clickhouse-server/users.d to change it.

(3)安装客户端
yum localinstall clickhouse-client-21.9.4.35-2.noarch.rpm
(4)开启服务
systemctl enable clickhouse-server #允许开机启动
systemctl start clickhouse-server # 启动服务
systemctl status clickhouse-server #查看服务状态
(5)客户端连接
clickhouse-client

点击查看结果
ClickHouse client version 21.9.4.35 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.9.4 revision 54449.

hadoop01 :)

2.使用

命令行客户端

clickhouse-client -m # 允许多行编辑
官方demo:
(1)创建表

点击查看代码
CREATE TABLE my_first_table
(
    user_id UInt32,
    message String,
    timestamp DateTime,
    metric Float32
)
ENGINE = MergeTree
PRIMARY KEY (user_id, timestamp)

(2)插入数据

点击查看代码
INSERT INTO my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 )

(3)查询
SELECT * FROM my_first_table ORDER BY timestamp

编程客户端

(1)准备数据
数据库:starcraft
表:race
数据:

┌─id─┬─name────┐
│  1 │ Zerg    │
│  2 │ Terren  │
│  3 │ Protoss │
└────┴─────────┘

(2)配置依赖

pom.xml
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.1.1</version>
<!--            <scope>provided</scope>-->
<!--provided作用是:编译环境可用,运行时则不可用,主要目的是打包时不将这些依赖打进来,因为服务器上有,以免包过大.但是,本地运行时,却需要这些依赖,因为本地时没有这些依赖的,所以报错。解决办法就是去除provided即可 -->
        </dependency>

        <!-- https://mvnrepository.com/artifact/com.clickhouse/clickhouse-jdbc -->
        <dependency>
            <groupId>com.clickhouse</groupId>
            <artifactId>clickhouse-jdbc</artifactId>
            <version>0.4.0</version>
        </dependency>

(3)spark读ClickHouse的数据

ClickHouse_Read.scala
package cn.beast.study

import org.apache.spark.sql.SparkSession

object ClickHouse_Read {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("ReadTest").master("local[*]").getOrCreate()

    spark.read.format("jdbc")
      .option("driver", "com.clickhouse.jdbc.ClickHouseDriver") // clickhouse 连接驱动
      .option("url", "jdbc:clickhouse://10.10.0.100:8123/starcraft") // 指定 clickhouse 连接库
      .option("user", "") // 用户名
      .option("password", "") // 密码
      .option("dbtable", "race") // 连接表
      .load()
      .orderBy("id")
      .show()

    spark.stop()

  }
}

# 如果出现拒绝连接的话,修改"/etc/clickhouse-server/config.xml".
# 将"<!-- <listen_host>::</listen_host> -->"的注释取消,允许IP4和IP6地址

(4)Spark向ClickHouse写数据

ClickHouse_Write.scala
package cn.beast.study

import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object ClickHouse_Write {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("WriteTest").master("local[*]").getOrCreate()

    spark.read
      .text("data/race.txt")
      .createOrReplaceTempView("data")

    val dataFrame: DataFrame = spark.sql(
      """
        |select
        |     split(value," ")[0] id,
        |     split(value," ")[1] name
        |from data
        |""".stripMargin)

    dataFrame.write.format("jdbc")
      .option("driver", "com.clickhouse.jdbc.ClickHouseDriver") // clickhouse 连接驱动
      .option("url", "jdbc:clickhouse://10.10.0.100:8123/starcraft") // 指定 clickhouse 连接库
      .option("user", "") // 用户名
      .option("password", "") // 密码
      .option("dbtable", "race") // 连接表,该表必须存在!否则会报错
      .mode(SaveMode.Append) // 以追加的模式写入数据
      .save()


    spark.stop()

  }
}

Scala

官网:https://www.scala-lang.org/

1.安装

OS:Ubuntu18
Scala版本:2.12.18

  • 下载
    wget https://www.scala-lang.org/files/archive/scala-2.12.18.deb
  • 安装
    sudo dpkg -i scala-2.12.18.deb
    dpkg -L scala #查找Scala安装路径

2.使用

Hive

1. 安装

tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /software/

2. 配置

(1)配置环境变量
vi /etc/bashrc

export HIVE_HOME=/software/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin

source /etc/bashrc
(2)配置hive-site.xml
外模式需要配置mysql,如果是内模式(derby)无需配置
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
vi hive-site.xml

:set nu #设置行号
:3:$d #自第3行至最后一行全部删除
点击查看代码
<property>
 <name>javax.jdo.option.ConnectionURL</name>
 <value>jdbc:mysql://10.10.0.100:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
</property>
<property>
 <name>javax.jdo.option.ConnectionDriverName</name>
 <value>com.mysql.jdbc.Driver</value>
</property>
<property>
 <name>javax.jdo.option.ConnectionUserName</name>
 <value>root</value>
</property>
<property>
 <name>javax.jdo.option.ConnectionPassword</name>
 <value>root</value>
</property>

3. 复制mysql的驱动

docker cp mysql-connector-java-5.1.39-bin.jar hadoop1:/software/apache-hive-3.1.2-bin/lib

4. 初始化元数据

./bin/schematool -dbType mysql -initSchema

报错1:
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.base.Preconditions.checkArgument
原因:
这是因为hive内依赖的guava.jar和hadoop内的版本不一致造成的。
解决方法:
1.查看hadoop安装目录下share/hadoop/common/lib内guava.jar版本
2.查看hive安装目录下lib内guava.jar的版本,如果两者不一致,拷贝高版本去替代较低版本!

再次初始化

5. 启动hive

(1)启动hadoop
start-all.sh
(2)启动hive
hive

Hive Session ID = 9e311256-b2af-4c85-a9b7-23fc38d1baef
hive> show databases;
OK
default
Time taken: 1.47 seconds, Fetched: 1 row(s)
hive> create database test_db1;
OK
Time taken: 0.459 seconds

Zookeeper

1. 安装

tar -zxvf apache-zookeeper-3.5.5-bin -C /software/

2. 配置

(1)环境变量

export ZOOKEEPER_HOME=/software/apache-zookeeper-3.5.5-bin
export PATH=$PATH:$ZOOKEEPER_HOME/bin

(2)配置zoo.cfg

# 先创建文件夹
cd /software/apache-zookeeper-3.5.5-bin/
mkdir data log
# 复制zoo_sample.cfg:
cd conf
cp zoo_sample.cfg zoo.cfg

vim zoo.cfg

dataDir=/software/apache-zookeeper-3.5.5-bin/data
dataLogDir=/software/apache-zookeeper-3.5.5-bin/log
server.1=master:2888:3888
server.2=slave1:2888:3888
server.3=slave2:2888:3888

(3)创建myid文件

在dataDir对应的目录下创建myid并输入内容:
cd /software/apache-zookeeper-3.5.5-bin/data
echo '1' >myid

(4)复制zookeeper-3.5.5到hadoop2、hadoop3

scp -r /software/apache-zookeeper-3.5.5-bin hadoop2:/software/
# 修改hadoop2的data目录下myid为2
vi /software/apache-zookeeper-3.5.5-bin/data/myid
2
scp -r /software/apache-zookeeper-3.5.5-bin hadoop3:/software/
# 修改hadoop3的data目录下myid为2
vi /software/apache-zookeeper-3.5.5-bin/data/myid
3

3. 启动

# 在三台机器上分别启动zkServer
zkServer.sh start
# 查看进程
jps
3048 QuorumPeerMain
# 查看状态
zkServer.sh status
# follow or leader

Hbase

1. 安装

tar -zxvf hbase-2.2.3-bin.tar.gz -C /software/

2. 配置

(1)环境变量

export HBASE_HOME=/software/hbase-2.2.3
export PATH=$PATH:$HBASE_HOME/bin

(2)配置hbase-env.sh

cd $HBASE_HOME/conf # 进入配置目录:
vi hbase-env.sh
	export JAVA_HOME=/software/jdk1.8.0_181
	export HBASE_MANAGES_ZK=false

(3)配置hbase-site.xml

vi hbase.site.xml

点击查看代码
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://hadoop1:9000/hbase</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>hadoop1,hadoop2,hadoop3</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/software/apache-zookeeper-3.5.5-bin/data</value>
    </property>
 </configuration>

(4)配置regionservers

vi regionservers

hadoop2
hadoop3
# 说明:regionservers文件中列出了你希望运行的HRegionServer

(5)复制hbase-2.2.3到hadoop2,hadoop3

scp -r /software/hbase-2.2.3 hadoop2:/software/
scp -r /software/hbase-2.2.3 hadoop3:/software/

3. 启动

# 启动hadoop集群
start-all.sh
# 启动zookeeper,注意:三台都要启动
zkServer.sh start
# 启动hbase
start-hbase.sh
# 查看进程
jps
# hadoop1进程
2225 NameNode
2564 ResourceManager
4245 Jps
4070 HMaster(启动hbase增加的)
3048 QuorumPeerMain(启动zookeeper增加的)
2413 SecondaryNameNode
# hadoop2进程
2057 DataNode
2749 HRegionServer(启动hbase增加的)
2367 QuorumPeerMain(启动zookeeper增加的)
2895 Jps
# web UI
http://10.10.0.100:16010
# 启动hbase shell
hbase shell
# 退出hbase shell
exit
# 关闭hbase
stop-hbase.sh

Spark

  • standalone模式

安装

配置

  1. 环境变量
# standalone
export SPARK_HOME=/software/spark-3.1.1-bin-hadoop3.2
  1. workers
hadoop2
hadoop3

复制到其它节点

scp -r /software/spark-3.1.1-bin-hadoop3.2 hadoop2:/software/
scp -r /software/spark-3.1.1-bin-hadoop3.2 hadoop3:/software/

启动

cd $SPARK_HOME/
bin/start-all.sh

hadoop1进程:
216 Jps
61 Master

hadoop2进程:
326 Jps
55 Worker

spark-shell

bin/spark-shell --master spark://hadoop1:7077

点击查看代码
Spark context Web UI available at http://a16e70f1373e:4040
Spark context available as 'sc' (master = spark://hadoop1:7077, app id = app-20240111081519-0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala> print("hello,spark!")
hello,spark!

spark-submit

官方例子,计算圆周率
bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://hadoop1:7077 examples/jars/spark-examples_2.12-3.1.1.jar

24/01/11 08:10:19 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 23.461299 s
Pi is roughly 3.1472157360786803

webUI
http://10.10.0.100:8080

Application ID Name Cores Memory per Executor Resources Per Executor Submitted Time User State Duration
app-20240111080947-0000 Spark Pi 2 1024.0 MiB 2024/01/11 08:09:47 root FINISHED 33 s

  • yarn(混合)模式

无需启动spark
启动hadoop
$>spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples_2.12-3.1.1.jar

Flink

  • 本地模式

  1. 解压
tar -xzvf flink-1.14.0-bin-scala_2.12.tgz
cd flink-1.14.0
  1. 启动
./bin/start-cluster.sh
  1. 提交作业
# flink自带的程序
./bin/flink run examples/streaming/WordCount.jar
tail log/flink-root-taskexecutor-0-hadoop01.out
#  (nymph,1)
#  (in,3)
...
  1. 停止
./bin/stop-cluster.sh
  • standalone模式

  1. 解压
tar -zxvf flink-1.14.0-bin-scala_2.12.tgz -C /software/
  1. 配置
  • conf/flink-conf.yaml
jobmanager.rpc.address=hadoop1
  • conf/workers
hadoop2
hadoop3
  1. 复制
scp -r flink-1.14.0/ hadoop2:/software/
scp -r flink-1.14.0/ hadoop3:/software/
  1. 启动
bin/start-cluster.sh
jps
点击查看结果
# hadoop1
633 Jps
61 Master
590 StandaloneSessionClusterEntrypoint
# hadoop2
678 Jps
55 Worker
603 TaskManagerRunner
  1. 测试
# flink自带的程序
./bin/flink run examples/streaming/WordCount.jar
# 处理结果在taskmanager节点上(hadoop2或hadoop3)
vi log/flink-root-taskexecutor-0-a16e70f1373e.out

  • yarn模式

  • Starting a Flink Session on YARN
  1. 配置hadoop类路径
export HADOOP_CLASSPATH=`hadoop classpath`
  1. 启动YARN session
    ./bin/yarn-session.sh --detached
  2. 提交示例作业
    ./bin/flink run ./examples/streaming/TopSpeedWindowing.jar
  3. 停止YARN Session
    echo "stop" | ./bin/yarn-session.sh -id application_1705020676638_0001

Flume

  • demo
  1. 编写配置文件
    conf/example.conf
点击查看代码
# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  1. 启动flume
    bin/flume-ng agent --conf conf --conf-file conf/example.conf --name a1

  2. 发送消息
    nc -l localhost 44444

  • demo2

本题来源于全国职业技能大赛之大数据技术赛项电商赛题 - 实时数据采集

题目
1、在Master节点使用Flume采集实时数据生成器25001端口的socket数据(实时数据生成器脚本为Master节点/data_log目录下的gen_ds_data_to_socket.脚本,该脚本为Master节点本地部署且使用socket传输),将数据存入到Kafka 的Topic 中 (Topic名称为ods_mall_og,分区数为4),使用Kafka自带的消费者消费ods_mal_log(Topic)中的数据,查看前⒉条数据的结果;
注:需先启动已配置好的Flume 再启动脚本,否则脚本将无法成功启动,启动方式为进入/data_1og,目录执行./gen_ds_data_to_socket(如果没有权限,请执行授权命令chmod 777 /data_log/gen_ds_data to_socket)

  1. 创建主题
    先启动kafka服务
    bin/kafka-topics.sh --create --topic ods_mall_log --bootstrap-server localhost:9092 --partitions 4 --replication-factor 1
  2. 配置flume
    conf/flume1.conf
点击查看代码
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 25001

# Describe the sink KafkaSink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.topic = ods_mall_log

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

  1. 发送数据
    nc -l localhost 25001

  2. 启动flume采集数据
    bin/flume-ng agent --name a1 --conf conf --conf-file conf/flume1.conf -Dflume.root.logger=INFO,console

  3. kafka消费数据
    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic ods_mall_log

posted @ 2023-12-11 15:30  框框A  阅读(65)  评论(0编辑  收藏  举报