Griffin-0.5.0 编译安装
1 编译安装
链接:https://pan.baidu.com/s/1464iXOObzkr_mrkqPcvQ3g
提取码:gcwe
1.1 相关依赖
- JDK (1.8 or later versions)
- MySQL(version 5.6及以上)
- Hadoop (2.6.0 or later)
- Hive (version 2.x)
- Maven Spark (version 2.2.1)
- Livy(livy-0.5.0-incubating)
- ElasticSearch (5.0 or later versions)
具体功能
- Spark:计算批量、实时指标
- Livy:为服务提供 RESTful API 调用 Apache Spark
- ElasticSearch:存储指标数据 MySQL:服务元数据
1.2 Maven及Spark安装
1.2.1 Maven安装(已安装则下一步)
[root@gcw2 src]# tar -zxvf apache-maven-3.6.3-bin.tar.gz -C ../app/
[root@gcw2 src]# cd ../app/
[root@gcw2 app]# ln -s apache-maven-3.6.3 maven
[root@gcw2 app]# cd maven/
[root@gcw2 maven]# pwd
/opt/app/maven
[root@gcw2 maven]# vi /etc/profile
##添加
export M2_HOME=/opt/app/maven
export PATH=$PATH:$M2_HOME/bin
##
[root@gcw2 maven]# source /etc/profile
1.2.2 Spark安装
1 解压缩,设置环境变量 $SPARK_HOME
[root@gcw2 src]# tar -zxvf spark-2.2.1-bin-hadoop2.7.tgz -C /opt/app/
[root@gcw2 src]# cd /opt/app/
[root@gcw2 app]# ln -s spark-2.2.1-bin-hadoop2.7 spark
## 设置环境变量
vi /etc/profile
export SPARK_HOME=/opt/app/spark
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
2 修改配置文件 $SPARK_HOME/conf/spark-defaults.conf
[root@gcw2 app]# cd $SPARK_HOME/conf/
[root@gcw2 conf]# clear
[root@gcw2 conf]# cp spark-defaults.conf.template spark-defaults.conf
[root@gcw2 conf]# vi spark-defaults.conf
文件最后添加,配置修改为自己配置
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://gcw1:9000/spark/logs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.yarn.jars hdfs://gcw1:9000/spark/spark_2.2.1_jars/*
mysql驱动 上传复制指定位置 我的在/opt/app/hive/lib下有一份
[root@gcw2 conf]# cp /opt/app/hive/lib/mysql-connector-java-5.1.46.jar $SPARK_HOME/jars/
文件路径要创建出,jar包上传
[root@gcw2 conf]# hdfs dfs -mkdir -p /spark/logs
[root@gcw2 conf]# hdfs dfs -mkdir -p /spark/spark_2.2.1_jars
[root@gcw2 conf]# hdfs dfs -put /opt/app/spark/jars/*.jar /spark/spark_2.2.1_jars
3 修改spark-env.sh
[root@gcw2 conf]# cp spark-env.sh.template spark-env.sh
[root@gcw2 conf]# vi spark-env.sh
文件最后添加,配置修改为自己配置
export JAVA_HOME=/opt/app/jdk
export HADOOP_HOME=/opt/app/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
4 yarn-site.xml 添加配置 关闭内存检查(资源足够略过)
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
重启Yarn
不添加该配配置启动spark-shell,有如下错误:Yarn application has already ended! It might have been killed or unable to launch application master.
5 启动测试spark
[root@gcw2 conf]# spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_261)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
执行
val lines = sc.textFile("/tmp/test/wc.txt")
lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()
scala> val lines = sc.textFile("/tmp/test/wc.txt")
scala> lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()
res0: Array[(String, Int)] = Array((example,1), (handle,1), (this,1), (initial,1), (metastore,1), (is,2), (hive,2), (guide,1), (snapshot,1), (maximum,1), (can,2), (need,1), (acknowledgement,1), (getting,1), (ticks,2), (between,1), (here,1), (/tmp,2), (will,1), ([QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@998],1), ("0",1), (dataDir=/opt/app/zookeeper/data,1), ("",3), (Address,1), (milliseconds,1), (sending,1), (nohup,2), (autopurge.,1), (WARN,1), ([myid:3],1), (feature,1), (already,1), (autopurge.purgeInterval=1,1), (#ZK提供了自动清理事务日志和快照文件的功能,这个参数指定了清理频率,单位是1时,1), (disable,1), (synchronization,1), (&,2), (syncLimit=5,1), (just,1), (connect,1), (increase,1), (clients,2), (read,1), (Set,1), (The,4), (tickTime=2000,1), (server.3=lew3:2181:3888,1), (interval,1), (number,5), (on,1)
1.3 Livy安装
1 解压缩,设置环境变量 $LIVY_HOME
[root@gcw2 src]# unzip livy-0.5.0-incubating-bin.zip
[root@gcw2 src]# mv livy-0.5.0-incubating-bin ../app/
[root@gcw2 app]# cd /opt/app
[root@gcw2 app]# ln -s livy-0.5.0-incubating-bin/ livy
# 设置环境变量
vi /etc/profile
export LIVY_HOME=/opt/app/livy
export PATH=$PATH:$LIVY_HOME/bin
source /etc/profile
2 修改配置 conf/livy.conf
[root@gcw2 app]# cd $LIVY_HOME/conf
[root@gcw2 conf]# cp livy.conf.template livy.conf
[root@gcw2 conf]# vi livy.conf
livy.server.host = 127.0.0.1
livy.spark.master = yarn
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true
3 修改配置livy-env.sh
[root@gcw2 conf]# cp livy-env.sh.template livy-env.sh
[root@gcw2 conf]# vi livy-env.sh
export SPARK_HOME=/opt/app/spark
export HADOOP_HOME=/opt/app/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
4 启动服务
cd /opt/app/livy
mkdir logs
nohup /opt/app/livy/bin/livy-server &
1.4 es5.6安装
1 解压缩
[root@gcw2 src]# tar -zxvf elasticsearch-5.6.0.tar.gz -C ../app/
[root@gcw2 src]# cd /opt/app/
[root@gcw2 app]# ln -s elasticsearch-5.6.0/ es
2 创建 elasticsearch用户组 及 elasticsearch 用户
# 创建用户组
groupadd elasticsearch
# 创建用户
useradd elasticsearch -g elasticsearch
# 修改安装目录的宿主
[root@gcw2 app]# cd /opt/app/
[root@gcw2 app]# ll | grep el
drwxr-xr-x 7 root root 131 Sep 7 2017 elasticsearch-5.6.0
lrwxrwxrwx 1 root root 20 Sep 12 00:05 es -> elasticsearch-5.6.0/
[root@gcw2 app]# chown -R elasticsearch:elasticsearch elasticsearch-5.6.0/
3 修改linux系统文件 /etc/security/limits.conf
elasticsearch hard nofile 1000000
elasticsearch soft nofile 1000000
* soft nproc 4096
* hard nproc 4096
4 修改系统文件 /etc/sysctl.conf
# 文件末尾增加:
vm.max_map_count=262144
# 执行以下命令,修改才能生效
sysctl -p
5 修改es配置文件
config/elasticsearch.yml
network.host: 0.0.0.0
config/jvm.options
-Xms1g
-Xmx1g
6 启动
su elasticsearch
cd /opt/app/es
bin/elasticsearch -d
浏览器访问 9200端口
7 索引创建
curl -XPUT http://gcw2:9200/griffin -d '
{
"aliases":{
},
"mappings":{
"accuracy":{
"properties":{
"name":{
"fields":{
"keyword":{
"ignore_above":256,
"type":"keyword"
}
},
"type":"text"
},
"tmst":{
"type":"date"
}
}
}
},
"settings":{
"index":{
"number_of_replicas":"2",
"number_of_shards":"5"
}
}
}
'
返回信息
{"acknowledged":true,"shards_acknowledged":true,"index":"griffin"}[elasticsearch@gcw2 es]$
准备工作做完
1.5 Griffin编译安装
1 解压软连接
[root@gcw2 src]# cd /opt/src/
[root@gcw2 src]# unzip griffin-griffin-0.5.0.zip
[root@gcw2 src]# mv griffin-griffin-0.5.0 /opt/app/
[root@gcw2 src]# cd /opt/app/
[root@gcw2 app]# ln -s griffin-griffin-0.5.0/ griffin
2 在MySQL中创建数据库quartz,并初始化
备注:要做简单的修改,主要是增加 use quartz;
/opt/app/griffin/service/src/main/resources/Init_quartz_mysql_innodb.sql
执行上述文件
# mysql中执行创建数据库
create database quartz;
# 修改文件 添加use quartz;
vi Init_quartz_mysql_innodb.sql;
# 命令行执行,创建表
mysql -uhive -p12345678 < Init_quartz_mysql_innodb.sql
#quartz中有下列表
QRTZ_BLOB_TRIGGERS
QRTZ_CALENDARS
QRTZ_CRON_TRIGGERS
QRTZ_FIRED_TRIGGERS
QRTZ_JOB_DETAILS
QRTZ_LOCKS
QRTZ_PAUSED_TRIGGER_GRPS
QRTZ_SCHEDULER_STATE
QRTZ_SIMPLE_TRIGGERS
QRTZ_SIMPROP_TRIGGERS
QRTZ_TRIGGERS
3 Hadoop和Hive
在HDFS上创建/spark/spark_conf目录,并将Hive的配置文件hive-site.xml上传到该 目录下
hdfs dfs -mkdir -p /spark/spark_conf
hdfs dfs -put $HIVE_HOME/conf/hive-site.xml /spark/spark_conf/
将安装 griffin 所在节点上的 hive-site.xml 文件,上传到 HDFS 对应目录中;
4 确保设置以下环境变量(/etc/profile)
export JAVA_HOME=/opt/app/jdk
export SPARK_HOME=/opt/app/spark
export LIVY_HOME=/opt/app/livy
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
[root@gcw2 resources]# echo $JAVA_HOME $HADOOP_CONF_DIR $SPARK_HOME $LIVY_HOME
/opt/app/jdk /opt/app/hadoop/etc/hadoop /opt/app/spark /opt/app/livy
1.6 Griffin编译
1 service/pom.xml文件配置
编辑 service/pom.xml(约113-117行),增加MySQL JDBC 依赖(即删除注释):
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.java.version}</version>
</dependency>
2 修改配置文件 service/src/main/resources/application.properties
server.port = 9876
spring.application.name=griffin_service
spring.datasource.url=jdbc:mysql://gcw3:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username=root
spring.datasource.password=123456
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore
hive.metastore.uris=thrift://gcw1:9083
hive.metastore.dbname=hivemetadata
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry
kafka.schema.registry.url=http://localhost:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name
fs.defaultFS=
# elasticsearch
elasticsearch.host=gcw2
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy
livy.uri=http://localhost:8998/batches
livy.need.queue=false
livy.task.max.concurrent.count=20
livy.task.submit.interval.second=3
livy.task.appId.retry.count=3
# yarn url
yarn.uri=http://gcw3:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook
默认端口是8080,为避免和spark端口冲突,这里端口修改为9876 需要启动Hive的 metastore 服务 如果Griffin、MySQL没有安装在同一节点,请确认用户有权限能够远程登录
3 修改配置文件 service/src/main/resources/quartz.properties
# 将第26行修改为以下内容:
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
4 修改配置文件 service/src/main/resources/sparkProperties.json
{
"file": "hdfs:///griffin/griffin-measure.jar",
"className": "org.apache.griffin.measure.Application",
"name": "griffin",
"queue": "default",
"numExecutors": 2,
"executorCores": 1,
"driverMemory": "1g",
"executorMemory": "1g",
"conf": {
"spark.yarn.dist.files": "hdfs:///spark/spark_conf/hive-site.xml"
},
"files": []
}
5 修改配置文件 service/src/main/resources/env/env_batch.json
修改第24行
"api": "http://gcw2:9200/griffin/accuracy",
6 编译
cd /opt/app/griffin
mvn -Dmaven.test.skip=true clean install
编译过程中需要下载500M+左右的jar,要将Maven的源设置到阿里 如果修改了前面的配置文件,需要重新编
7 编译报错
Cannot find name 'SVGElementTagNameMap'. [ERROR] ERROR
这个文件在编译之前是没有的
/opt/app/griffin/ui/angular/node_modules/@types/jquery/JQuery.d.ts
删除 4137 行
find<K extends keyof SVGElementTagNameMap>(selector_element: K| JQuery<K>): JQuery<SVGElementTagNameMap[K]>;
parents<K extends keyof SVGElementTagNameMap>(selector: K | JQuery<K>): JQuery<SVGElementTagNameMap[K]>;
8 再次编译
cd /opt/app/griffin
mvn -Dmaven.test.skip=true clean install
成功
[INFO]
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ measure ---
[INFO] Installing /opt/app/griffin-griffin-0.5.0/measure/target/measure-0.5.0.jar to /root/.m2/repository/org/apache/griffin/measure/0.5.0/measure-0.5.0.jar
[INFO] Installing /opt/app/griffin-griffin-0.5.0/measure/pom.xml to /root/.m2/repository/org/apache/griffin/measure/0.5.0/measure-0.5.0.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO]
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [ 0.919 s]
[INFO] Apache Griffin :: UI :: Default UI ................. SUCCESS [ 35.044 s]
[INFO] Apache Griffin :: Web Service ...................... SUCCESS [07:03 min]
[INFO] Apache Griffin :: Measures ......................... SUCCESS [02:58 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10:38 min
[INFO] Finished at: 2021-09-13T21:59:47+08:00
[INFO] ------------------------------------------------------------------------
9 jar拷贝
# 将 service-0.5.0.jar 拷贝到 /opt/app/griffin
cd /opt/app/griffin/service/target
cp service-0.5.0.jar /opt/app/griffin
# 将 measure-0.5.0.jar 拷贝到 /opt/app/griffin,并改名
cd /opt/app/griffin/measure/target
cp measure-0.5.0.jar /opt/app/griffin/griffin-measure.jar
# 将 griffin-measure.jar 上传到 hdfs:///griffin 中
cd /opt/app/griffin
hdfs dfs -mkdir /griffin
hdfs dfs -put griffin-measure.jar /griffin
备注:spark在yarn集群上执行任务时,需要到HDFS的/griffin目录下加载griffinmeasure.jar,避免发生类org.apache.griffin.measure.Application找
1.7 启动Griffin服务
cd /opt/app/griffin
nohup java -jar service-0.5.0.jar>service.out 2>&1 &
用户名口令:admin / admin
1.8 启动和关闭相关服务
启动
## gcw2 livy启动
nohup /opt/app/livy/bin/livy-server &
## gcw2 es启动
su elasticsearch
cd /opt/app/es
bin/elasticsearch -d
exit
## gcw1 hive metastore启动
[root@gcw1 ~]# nohup hive --service metastore &
关闭
## jps相应进程 kill -9 停止即可 以es为例
[elasticsearch@gcw2 es]$ jps
7008 Jps
6705 Elasticsearch
[elasticsearch@gcw2 es]$ kill -9 6705
2 数据质量
2.1 数据资产
2.2 数据度量 创建 measure
- 如果要测量源和目标之间的匹配率,请选择 Accuracy(精确度验证)
- 如果要检查数据的特定值(例如:空列计数),请选择 Data Profiling(数据统 计分析)
- 统计表的特定列里面值为空、唯一或是重复的数量
- 统计最大值、最小值、平均数、中值等
- 用正则表达式来对数据的频率和模式进行分析