Airflow2.1.1超详细安装文档
Mysql
安装
MySQL
安装可以参考我之前写过的博客:linux下安装MySQL5.7及遇到的问题总结
MySQL
安装完成后,需要创建airflow
数据库,用户,并赋予相关权限
CREATE DATABASE airflow CHARACTER SET utf8; CREATE USER 'airflow'@'%' IDENTIFIED BY 'yourpassword'; GRANT ALL PRIVILEGES ON *.* TO 'airflow'@'%' IDENTIFIED BY 'yourpassword' WITH GRANT OPTION; set global explicit_defaults_for_timestamp =1; FLUSH PRIVILEGES;
安装python3.7.5(重要)
该部分需要在所有airflow安装节点
进行操作
Airflow
官方文档中,给出的安装方式是Python3
,CentOS7
机器上是默认是python2
,安装airflow
过程中会出现各种各样的问题.
安装编译相关工具
yum -y groupinstall "Development tools" yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel yum install libffi-devel -y
下载编译Python3.7
wget https://www.python.org/ftp/python/3.7.5/Python-3.7.5.tar.xz tar -xvJf Python-3.7.5.tar.xz mkdir /usr/python3.7 cd Python-3.7.5 ./configure --prefix=/usr/python3.7 make && make install
创建软链接
ln -s /usr/python3.7/bin/python3 /usr/bin/python3.7 ln -s /usr/python3.7/bin/pip3 /usr/bin/pip3.7
验证是否安装成功
python3.7 -V pip3.7 -V
如下所示,证明配置成功:
因为执行yum
需要python2
版本,所以我们还要修改yum
的配置
vim /usr/bin/yum #! /usr/bin/python修改为#! /usr/bin/python2
vim /usr/libexec/urlgrabber-ext-down #! /usr/bin/python 也要修改为#! /usr/bin/python2
确保安装必要软件(重要)
# 安装airflow pip版本过低会导致安装失败 pip3.7 install --upgrade pip sudo pip3.7 install pymysql sudo pip3.7 install celery sudo pip3.7 install flower sudo pip3.7 install psycopg2-binary
二、安装Airflow(重要)
注意: 2.1,2.2,2.3
部分需要在所有安装节点进行操作
2.1 配置 airflow sudo权限
这里使用airflow
用户进行
配置airflow用户
sudo权限
# 以下命令使用root用户 useradd airflow vi /etc/sudoers ## Allow root to run any commands anywhere rootALL=(ALL) ALL airflow ALL=(ALL) NOPASSWD: ALL #加入这一行
2.2 设置Airflow环境变量
安装完后airflow安装路径默认为: /home/airflow/.local/bin
#使用root用户执行 vi /etc/profile export PATH=$PATH:/usr/python3.7/bin:/home/airflow/.local/bin source /etc/profile
此处的/home/airflow/.local/bin
为~/.local/bin
,
根据实际配置PATH=$PATH:~/.local/bin
#配置环境变量,使用airflow用户执行(可选,默认为~/airflow) export AIRFLOW_HOME=~/airflow
2.3 安装airflow
su airflow #root用户 # 以下命令使用airflow用户 AIRFLOW_VERSION=2.1.1 PYTHON_VERSION="$(python3.7 --version | cut -d " " -f 2 | cut -d "." -f 1-2)" CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-no-providers-${PYTHON_VERSION}.txt" # 这里要加sudo,否则会存在部分缺失,并且没有报错,这里要注意添加mysql,celery,cncf.kubernetes依赖,否则后续启动airflow时会报错 sudo pip3.7 install "apache-airflow[mysql,celery,cncf.kubernetes]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" -i https://pypi.rasa.com/simple --use-deprecated=legacy-resolver
如果airflow
安装正常,此时将能够使用airflow命令
,并且airflow安装目录下有如下文件:
airflow.cfg webserver_config.py
2.4 配置ariflow
airflow
高可用架构如下:
修改{AIRFLOW_HOME}/airflow.cfg
文件
# 在{AIRFLOW_HOME}/airflow.cfg 添加或者修改如下配置 # 1. 修改Executor配置 # executor = LocalExecutor executor = CeleryExecutor # 2. 修改元数据库(metestore)配置 #sql_alchemy_conn = sqlite:home/apps/airflow/airflow.db sql_alchemy_conn = mysql+pymysql://airflow:yourpassword@hostname:3306/airflow # 3.设置消息队列broker,此处使用 RabbitMQ # broker_url = redis://redis:6379/0 broker_url = amqp://admin:yourpassword@hostname:5672/ # 4.设定结果存储后端backend # result_backend = db+postgresql://postgres:airflow@postgres/airflow result_backend = db+mysql://airflow:yourpassword@hostname:3306/airflow # 5. 修改时区 # default_timezone = utc default_timezone = Asia/Shanghai default_ui_timezone = Asia/Shanghai # 6. 配置web端口(默认8080,因为被ambari占用所以改为8081) endpoint_url = http://localhost:8081 base_url = http://localhost:8081 web_server_port = 8081
修改后的{AIRFLOW_HOME}/airflow.cfg需要同步到所有安装airflow的服务器上
同时,需要根据dags_folder,base_log_folder
配置创建相关目录,防止后面执行dag时报错
2.5 启动airflow集群
初始化数据库
airflow db init
mysql
中出现如下表结构证明初始化成功
创建用户:
airflow users create \ --username admin \ --firstname Lixiaolong \ --lastname Bigdata \ --role Admin \ --email spiderman@superhero.org
根据控制台输出设置Password
Password设置为:yourpassword
启动webserver:
airflow webserver -D
启动scheduler
nohup airflow scheduler &;
启动worker
# 先启动flower,在需要启动worker服务器执行 airflow celery flower -D # 启动worker,在需要启动worker服务器执行 airflow celery worker -D
2.6 登录webui查看
webui: http://master1:8081/ 账号: admin 密码: 2.5阶段设置的密码
界面显示如下图:
worker的信息可以通过http://hostip:5555
进行查看,如下图:
2.7 使用Airflow配置作业
Airflow
默认配置了32
个Dag
供大家食用,webui
选中Dag
点击一下,即可变成Active
状态
接下来,以常用的Hive Operator
举例,如何编写并执行自定义Dag
依赖安装
使用Hive Operator
,需要首先安装Hive
相关依赖
如果使用中遇到类似如下的问题:
ModuleNotFoundError: No module named 'airflow.providers.apache'
就需要手动安装hive
依赖,命令如下
su airflow pip3.7 install airflow[hive]
Dag编写
Dag
目录: 见airflow.cfg
配置项dags_folder
将写好的python
文件放置该目录下,举例:
该示例为定时 每隔一分钟
查询hive
表中数据,Dag
名称为test_hive2
from airflow import DAG from airflow.providers.apache.hive.operators.hive import HiveOperator from datetime import datetime, timedelta from airflow.models import Variable from airflow.utils.dates import days_ago default_args = { 'owner': 'airflow', 'depends_on_past': True, 'start_date': days_ago(1), 'retries': 10, 'retry_delay': timedelta(seconds=5), } dag = DAG('test_hive2', default_args=default_args, schedule_interval='*/1 * * * *', catchup=False) t1 = HiveOperator( task_id='hive_task', hql='select * from test.data_demo', dag=dag) t1
如果Dag
格式正确,将会在webui
上刷新出新添加的dag
信息
配置Connection
如下图所示,界面点击admin
–>Connections
配置connection
hive
默认使用的connections
是hive_cli_default
需要注意下图中标记出来的几个配置项:Conn Type
选择Hive Client Wrapper(如果安装了hive依赖,默认就是这个)
Host
设置为安装了Hive
的节点Login
需要设置为一个有权限执行hive
任务的用户
配置完成,保存即可
任务调度
如下图所示,为开启调度
,和手动触发
任务触发后,点击任务栏中间的部分,可以查看任务运行细节
,
举例,点进去一个任务之后,我们可以看到它的运行细节
和运行日志
三. 遇到的问题
3.1 python模块下载报错
Collecting flask-appbuilder<2.0.0,>=1.12.2; python_version < "3.6" Using cached Flask-AppBuilder-1.13.1.tar.gz (1.5 MB) ERROR: Command errored out with exit status 1: command: /bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-EFxJZq/flask-appbuilder/setup.py'"'"'; __file__='"'"'/tmp/pip-install-EFxJZq/flask-appbuilder/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-StYjJL cwd: /tmp/pip-install-EFxJZq/flask-appbuilder/ Complete output (3 lines): /usr/lib64/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'long_description_content_type' warnings.warn(msg) error in Flask-AppBuilder setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers
解决方案:
把setuptools
升级到最新版即可
pip install setuptools -U
3.2 执行ariflow相关命令报错 error: sqlite C library version too old (< {min_sqlite_version}).
详细报错如下:
Traceback (most recent call last): File "/usr/python3.7/bin/airflow", line 5, in <module> from airflow.__main__ import main File "/usr/python3.7/lib/python3.7/site-packages/airflow/__init__.py", line 34, in <module> from airflow import settings File "/usr/python3.7/lib/python3.7/site-packages/airflow/settings.py", line 35, in <module> from airflow.configuration import AIRFLOW_HOME, WEBSERVER_CONFIG, conf # NOQA F401 File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 1114, in <module> conf.validate() File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 202, in validate self._validate_config_dependencies() File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 243, in _validate_config_dependencies f"error: sqlite C library version too old (< {min_sqlite_version}). " airflow.exceptions.AirflowConfigException: error: sqlite C library version too old (< 3.15.0). See https://airflow.apache.org/docs/apache-airflow/2.1.1/howto/set-up-database.rst#setting-up-a-sqlite-database
原因: airflow
默认使用sqlite
作为metastore
,但我们使用的是mysql
,实际上用不到sqlite
解决方案:修改{AIRFLOW_HOME}/airflow.cfg
,
将元数据库信息sql_alchemy_conn
修改为
sql_alchemy_conn = mysql+pymysql://airflow:yourpassword@hostname:3306/airflow`
3.3 执行airflow db init失败 Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql
File "/usr/python3.7/lib/python3.7/site-packages/airflow/migrations/versions/0e2a74e0fc9f_add_time_zone_awareness.py", line 44, in upgrade raise Exception("Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql") Exception: Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql
解决方法:
进入mysql airflow
数据库,设置global explicit_defaults_for_timestamp
SHOW GLOBAL VARIABLES LIKE '%timestamp%'; SET GLOBAL explicit_defaults_for_timestamp =1;
设置前:
设置后: