DataHub开源元数据管理工具搭建及使用
一、DataHub安装
1、安装docker和docker-compose
yum -y install docker
curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
查看是否安装成功:
docker --version
docker-compose --version
2、安装jq
yum install epel-release
yum -y install jq
3、安装python3
yum install python-pip gcc gcc-c++ python-virtualenv cyrus-sasl-devel
yum -y groupinstall "Development tools"
yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel
wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tgz
tar -zxvf Python-3.7.3.tgz
mkdir /usr/local/python3
cd Python-3.7.3
./configure --prefix=/usr/local/python3
make && make install
修改系统python环境:
rm -rf /usr/bin/python
ln -s /usr/local/python3/bin/python3 /usr/bin/python
修改pip环境:
rm -rf /usr/bin/pip
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
将python环境改为python3后需要改下yum的文件,默认使用的python2:
vi /usr/bin/yum => 把 #! /usr/bin/python 修改为 #! /usr/bin/python2
vi /usr/libexec/urlgrabber-ext-down => 把 #! /usr/bin/python 修改为 #! /usr/bin/python2
升级pip:
python -m pip install --upgrade pip wheel setuptools
4、安装和启动DataHub
python -m pip uninstall datahub acryl-datahub || true
python -m pip install --upgrade acryl-datahub
python -m datahub version
python -m datahub docker quickstart
二、实践
1、导入mysql元数据信息(这里重新用docker创建一个mysql容器)
docker run -p 13306:3306 --name ownmysql -v /opt/docker_data/mysql/conf:/etc/mysql/conf.d -v /opt/docker_data/mysql/logs:/logs -v /opt/docker_data/mysql/data:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=123456 -d mysql
安装mysql插件:
pip install 'acryl-datahub[mysql]'
检查已经安装的插件:
python -m datahub check plugins
2、编写yam文件,通过rest接口读取mysql的元数据信息
source:
type: mysql
config:
host_port: node:13306
username: root
password: 123456
database: aucc
sink:
type: "datahub-rest"
config:
server: "http://node:8080"
3、摄取
python -m datahub ingest -c mysql_to_datahub_rest.yml
4、hive元数据信息摄取
安装前置:
yum install cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi
pip install 'acryl-datahub[hive]'
source:
type: hive
config:
host_port: node:10000
username:
password:
database: default
sink:
type: "datahub-rest"
config:
server: "http://node:8080"
python -m datahub ingest -c hive_to_datahub_rest.yml
5、界面