Title

Griffin-0.5.0 编译安装

1 编译安装

软件链接

链接:https://pan.baidu.com/s/1464iXOObzkr_mrkqPcvQ3g 
提取码:gcwe 

1.1 相关依赖

  • JDK (1.8 or later versions)
  • MySQL(version 5.6及以上)
  • Hadoop (2.6.0 or later)
  • Hive (version 2.x)
  • Maven Spark (version 2.2.1)
  • Livy(livy-0.5.0-incubating)
  • ElasticSearch (5.0 or later versions)

具体功能

  • Spark:计算批量、实时指标
  • Livy:为服务提供 RESTful API 调用 Apache Spark
  • ElasticSearch:存储指标数据 MySQL:服务元数据

1.2 Maven及Spark安装

1.2.1 Maven安装(已安装则下一步)

[root@gcw2 src]# tar  -zxvf apache-maven-3.6.3-bin.tar.gz -C ../app/
[root@gcw2 src]# cd ../app/
[root@gcw2 app]# ln -s apache-maven-3.6.3 maven
[root@gcw2 app]# cd maven/
[root@gcw2 maven]# pwd
/opt/app/maven
[root@gcw2 maven]# vi /etc/profile
##添加
export M2_HOME=/opt/app/maven
export PATH=$PATH:$M2_HOME/bin
##
[root@gcw2 maven]# source /etc/profile

1.2.2 Spark安装

1 解压缩,设置环境变量 $SPARK_HOME

[root@gcw2 src]# tar -zxvf spark-2.2.1-bin-hadoop2.7.tgz -C /opt/app/
[root@gcw2 src]# cd /opt/app/
[root@gcw2 app]# ln -s spark-2.2.1-bin-hadoop2.7 spark
## 设置环境变量
vi /etc/profile
export SPARK_HOME=/opt/app/spark
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile

2 修改配置文件 $SPARK_HOME/conf/spark-defaults.conf

[root@gcw2 app]# cd  $SPARK_HOME/conf/
[root@gcw2 conf]# clear
[root@gcw2 conf]# cp spark-defaults.conf.template spark-defaults.conf
[root@gcw2 conf]# vi spark-defaults.conf

文件最后添加,配置修改为自己配置

spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://gcw1:9000/spark/logs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.yarn.jars hdfs://gcw1:9000/spark/spark_2.2.1_jars/*

mysql驱动 上传复制指定位置 我的在/opt/app/hive/lib下有一份

[root@gcw2 conf]# cp /opt/app/hive/lib/mysql-connector-java-5.1.46.jar  $SPARK_HOME/jars/

文件路径要创建出,jar包上传

[root@gcw2 conf]# hdfs dfs -mkdir -p /spark/logs
[root@gcw2 conf]# hdfs dfs -mkdir -p /spark/spark_2.2.1_jars
[root@gcw2 conf]# hdfs dfs -put /opt/app/spark/jars/*.jar /spark/spark_2.2.1_jars

3 修改spark-env.sh

[root@gcw2 conf]# cp spark-env.sh.template spark-env.sh
[root@gcw2 conf]# vi spark-env.sh

文件最后添加,配置修改为自己配置

export JAVA_HOME=/opt/app/jdk
export HADOOP_HOME=/opt/app/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

4 yarn-site.xml 添加配置 关闭内存检查(资源足够略过)

<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>

重启Yarn

不添加该配配置启动spark-shell,有如下错误:Yarn application has already ended! It might have been killed or unable to launch application master.

5 启动测试spark

[root@gcw2 conf]# spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_261)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

执行

val lines = sc.textFile("/tmp/test/wc.txt")
lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()
scala> val lines = sc.textFile("/tmp/test/wc.txt")
scala> lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()

res0: Array[(String, Int)] = Array((example,1), (handle,1), (this,1), (initial,1), (metastore,1), (is,2), (hive,2), (guide,1), (snapshot,1), (maximum,1), (can,2), (need,1), (acknowledgement,1), (getting,1), (ticks,2), (between,1), (here,1), (/tmp,2), (will,1), ([QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@998],1), ("0",1), (dataDir=/opt/app/zookeeper/data,1), ("",3), (Address,1), (milliseconds,1), (sending,1), (nohup,2), (autopurge.,1), (WARN,1), ([myid:3],1), (feature,1), (already,1), (autopurge.purgeInterval=1,1), (#ZK提供了自动清理事务日志和快照文件的功能,这个参数指定了清理频率,单位是1时,1), (disable,1), (synchronization,1), (&,2), (syncLimit=5,1), (just,1), (connect,1), (increase,1), (clients,2), (read,1), (Set,1), (The,4), (tickTime=2000,1), (server.3=lew3:2181:3888,1), (interval,1), (number,5), (on,1)

1.3 Livy安装

1 解压缩,设置环境变量 $LIVY_HOME

[root@gcw2 src]# unzip livy-0.5.0-incubating-bin.zip
[root@gcw2 src]# mv livy-0.5.0-incubating-bin ../app/
[root@gcw2 app]# cd /opt/app
[root@gcw2 app]# ln -s livy-0.5.0-incubating-bin/ livy
# 设置环境变量
vi /etc/profile
export LIVY_HOME=/opt/app/livy
export PATH=$PATH:$LIVY_HOME/bin
source /etc/profile

2 修改配置 conf/livy.conf

[root@gcw2 app]# cd   $LIVY_HOME/conf
[root@gcw2 conf]# cp livy.conf.template livy.conf
[root@gcw2 conf]# vi livy.conf
livy.server.host = 127.0.0.1
livy.spark.master = yarn
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true

3 修改配置livy-env.sh

[root@gcw2 conf]# cp livy-env.sh.template livy-env.sh
[root@gcw2 conf]# vi livy-env.sh
export SPARK_HOME=/opt/app/spark
export HADOOP_HOME=/opt/app/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

4 启动服务

cd /opt/app/livy
mkdir logs
nohup /opt/app/livy/bin/livy-server &

1.4 es5.6安装

1 解压缩

[root@gcw2 src]# tar -zxvf elasticsearch-5.6.0.tar.gz -C ../app/
[root@gcw2 src]# cd /opt/app/
[root@gcw2 app]# ln -s elasticsearch-5.6.0/ es

2 创建 elasticsearch用户组 及 elasticsearch 用户

# 创建用户组
groupadd elasticsearch
# 创建用户
useradd elasticsearch -g elasticsearch
# 修改安装目录的宿主
[root@gcw2 app]# cd /opt/app/
[root@gcw2 app]# ll | grep el
drwxr-xr-x  7 root   root   131 Sep  7  2017 elasticsearch-5.6.0
lrwxrwxrwx  1 root   root    20 Sep 12 00:05 es -> elasticsearch-5.6.0/
[root@gcw2 app]# chown -R elasticsearch:elasticsearch elasticsearch-5.6.0/

3 修改linux系统文件 /etc/security/limits.conf

elasticsearch hard nofile 1000000
elasticsearch soft nofile 1000000
* soft nproc 4096
* hard nproc 4096

4 修改系统文件 /etc/sysctl.conf

# 文件末尾增加:
vm.max_map_count=262144
# 执行以下命令,修改才能生效
sysctl -p

5 修改es配置文件

config/elasticsearch.yml

network.host: 0.0.0.0

config/jvm.options

-Xms1g
-Xmx1g

6 启动

su elasticsearch
cd /opt/app/es
bin/elasticsearch -d

浏览器访问 9200端口

7 索引创建

curl -XPUT http://gcw2:9200/griffin -d '
{
    "aliases":{

    },
    "mappings":{
        "accuracy":{
            "properties":{
                "name":{
                    "fields":{
                        "keyword":{
                            "ignore_above":256,
                            "type":"keyword"
                        }
                    },
                    "type":"text"
                },
                "tmst":{
                    "type":"date"
                }
            }
        }
    },
    "settings":{
        "index":{
            "number_of_replicas":"2",
            "number_of_shards":"5"
        }
    }
}
'

返回信息

{"acknowledged":true,"shards_acknowledged":true,"index":"griffin"}[elasticsearch@gcw2 es]$ 

准备工作做完

1.5 Griffin编译安装

1 解压软连接

[root@gcw2 src]# cd /opt/src/
[root@gcw2 src]# unzip griffin-griffin-0.5.0.zip 
[root@gcw2 src]# mv griffin-griffin-0.5.0 /opt/app/
[root@gcw2 src]# cd /opt/app/
[root@gcw2 app]# ln -s griffin-griffin-0.5.0/ griffin

2 在MySQL中创建数据库quartz,并初始化

备注:要做简单的修改,主要是增加 use quartz;

/opt/app/griffin/service/src/main/resources/Init_quartz_mysql_innodb.sql

执行上述文件

# mysql中执行创建数据库
create database quartz;
# 修改文件 添加use quartz;
vi Init_quartz_mysql_innodb.sql;
# 命令行执行,创建表
mysql -uhive -p12345678 < Init_quartz_mysql_innodb.sql
#quartz中有下列表
QRTZ_BLOB_TRIGGERS
QRTZ_CALENDARS
QRTZ_CRON_TRIGGERS
QRTZ_FIRED_TRIGGERS
QRTZ_JOB_DETAILS
QRTZ_LOCKS
QRTZ_PAUSED_TRIGGER_GRPS
QRTZ_SCHEDULER_STATE
QRTZ_SIMPLE_TRIGGERS
QRTZ_SIMPROP_TRIGGERS
QRTZ_TRIGGERS

3 Hadoop和Hive

在HDFS上创建/spark/spark_conf目录,并将Hive的配置文件hive-site.xml上传到该 目录下

hdfs dfs -mkdir -p /spark/spark_conf
hdfs dfs -put $HIVE_HOME/conf/hive-site.xml /spark/spark_conf/

将安装 griffin 所在节点上的 hive-site.xml 文件,上传到 HDFS 对应目录中;

4 确保设置以下环境变量(/etc/profile)

export JAVA_HOME=/opt/app/jdk
export SPARK_HOME=/opt/app/spark
export LIVY_HOME=/opt/app/livy
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
[root@gcw2 resources]# echo $JAVA_HOME $HADOOP_CONF_DIR $SPARK_HOME  $LIVY_HOME
/opt/app/jdk /opt/app/hadoop/etc/hadoop /opt/app/spark /opt/app/livy

1.6 Griffin编译

1 service/pom.xml文件配置

编辑 service/pom.xml(约113-117行),增加MySQL JDBC 依赖(即删除注释):

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>${mysql.java.version}</version>
</dependency>

2 修改配置文件 service/src/main/resources/application.properties

server.port = 9876

spring.application.name=griffin_service
spring.datasource.url=jdbc:mysql://gcw3:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username=root
spring.datasource.password=123456
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true

# Hive metastore
hive.metastore.uris=thrift://gcw1:9083
hive.metastore.dbname=hivemetadata
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms

# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000

# Kafka schema registry
kafka.schema.registry.url=http://localhost:8081

# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000

# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000

# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12

# external properties directory location
external.config.location=

# external BATCH or STREAMING env
external.env.location=

# login strategy ("default" or "ldap")
login.strategy=default

# ldap
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})

# hdfs default name
fs.defaultFS=
# elasticsearch
elasticsearch.host=gcw2
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy
livy.uri=http://localhost:8998/batches
livy.need.queue=false
livy.task.max.concurrent.count=20
livy.task.submit.interval.second=3
livy.task.appId.retry.count=3
# yarn url
yarn.uri=http://gcw3:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook

默认端口是8080,为避免和spark端口冲突,这里端口修改为9876 需要启动Hive的 metastore 服务 如果Griffin、MySQL没有安装在同一节点,请确认用户有权限能够远程登录

3 修改配置文件 service/src/main/resources/quartz.properties

# 将第26行修改为以下内容:
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate

4 修改配置文件 service/src/main/resources/sparkProperties.json

{
	"file": "hdfs:///griffin/griffin-measure.jar",
	"className": "org.apache.griffin.measure.Application",
	"name": "griffin",
	"queue": "default",
	"numExecutors": 2,
	"executorCores": 1,
	"driverMemory": "1g",
	"executorMemory": "1g",
	"conf": {
		"spark.yarn.dist.files": "hdfs:///spark/spark_conf/hive-site.xml"
	},
	"files": []
}

5 修改配置文件 service/src/main/resources/env/env_batch.json

修改第24行

"api": "http://gcw2:9200/griffin/accuracy",

6 编译

cd /opt/app/griffin
mvn -Dmaven.test.skip=true clean install

编译过程中需要下载500M+左右的jar,要将Maven的源设置到阿里 如果修改了前面的配置文件,需要重新编

7 编译报错

Cannot find name 'SVGElementTagNameMap'. [ERROR] ERROR

这个文件在编译之前是没有的

/opt/app/griffin/ui/angular/node_modules/@types/jquery/JQuery.d.ts

删除 4137 行

find<K extends keyof SVGElementTagNameMap>(selector_element: K| JQuery<K>): JQuery<SVGElementTagNameMap[K]>;

    parents<K extends keyof SVGElementTagNameMap>(selector: K | JQuery<K>): JQuery<SVGElementTagNameMap[K]>;

8 再次编译

cd /opt/app/griffin
mvn -Dmaven.test.skip=true clean install

成功

[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ measure ---
[INFO] Installing /opt/app/griffin-griffin-0.5.0/measure/target/measure-0.5.0.jar to /root/.m2/repository/org/apache/griffin/measure/0.5.0/measure-0.5.0.jar
[INFO] Installing /opt/app/griffin-griffin-0.5.0/measure/pom.xml to /root/.m2/repository/org/apache/griffin/measure/0.5.0/measure-0.5.0.pom
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Griffin 0.5.0 0.5.0:
[INFO] 
[INFO] Apache Griffin 0.5.0 ............................... SUCCESS [  0.919 s]
[INFO] Apache Griffin :: UI :: Default UI ................. SUCCESS [ 35.044 s]
[INFO] Apache Griffin :: Web Service ...................... SUCCESS [07:03 min]
[INFO] Apache Griffin :: Measures ......................... SUCCESS [02:58 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  10:38 min
[INFO] Finished at: 2021-09-13T21:59:47+08:00
[INFO] ------------------------------------------------------------------------

9 jar拷贝

# 将 service-0.5.0.jar 拷贝到 /opt/app/griffin
cd /opt/app/griffin/service/target
cp service-0.5.0.jar /opt/app/griffin
# 将 measure-0.5.0.jar 拷贝到 /opt/app/griffin,并改名
cd /opt/app/griffin/measure/target
cp measure-0.5.0.jar /opt/app/griffin/griffin-measure.jar
# 将 griffin-measure.jar 上传到 hdfs:///griffin 中
cd /opt/app/griffin
hdfs dfs -mkdir /griffin
hdfs dfs -put griffin-measure.jar /griffin

备注:spark在yarn集群上执行任务时,需要到HDFS的/griffin目录下加载griffinmeasure.jar,避免发生类org.apache.griffin.measure.Application找

1.7 启动Griffin服务

cd /opt/app/griffin
nohup java -jar service-0.5.0.jar>service.out 2>&1 &

用户名口令:admin / admin

image-20210913221424999

1.8 启动和关闭相关服务

启动

## gcw2 livy启动
nohup /opt/app/livy/bin/livy-server &
## gcw2 es启动
su elasticsearch
cd /opt/app/es
bin/elasticsearch -d
exit

## gcw1 hive metastore启动
[root@gcw1 ~]# nohup hive --service metastore &

关闭

## jps相应进程  kill -9 停止即可  以es为例 
[elasticsearch@gcw2 es]$ jps
7008 Jps
6705 Elasticsearch
[elasticsearch@gcw2 es]$ kill -9 6705

2 数据质量

2.1 数据资产

image-20210913221901669

2.2 数据度量 创建 measure

  • 如果要测量源和目标之间的匹配率,请选择 Accuracy(精确度验证)
  • 如果要检查数据的特定值(例如:空列计数),请选择 Data Profiling(数据统 计分析)
    • 统计表的特定列里面值为空、唯一或是重复的数量
    • 统计最大值、最小值、平均数、中值等
    • 用正则表达式来对数据的频率和模式进行分析

image-20210913222059289

posted @ 2021-09-13 23:12  apeGcWell  阅读(217)  评论(0编辑  收藏  举报