Airflow2.1.1超详细安装文档

Mysql

安装

MySQL安装可以参考我之前写过的博客:linux下安装MySQL5.7及遇到的问题总结

MySQL安装完成后,需要创建airflow数据库,用户,并赋予相关权限

CREATE DATABASE airflow CHARACTER SET utf8;
CREATE USER 'airflow'@'%' IDENTIFIED BY 'yourpassword';
GRANT ALL PRIVILEGES ON *.* TO 'airflow'@'%' IDENTIFIED BY 'yourpassword' WITH GRANT OPTION;
set global explicit_defaults_for_timestamp =1;
FLUSH PRIVILEGES;
安装python3.7.5(重要)

该部分需要在所有airflow安装节点进行操作

Airflow官方文档中,给出的安装方式是Python3,CentOS7机器上是默认是python2,安装airflow过程中会出现各种各样的问题.

安装编译相关工具

yum -y groupinstall "Development tools"
yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel
yum install libffi-devel -y

下载编译Python3.7

wget https://www.python.org/ftp/python/3.7.5/Python-3.7.5.tar.xz
tar -xvJf Python-3.7.5.tar.xz
mkdir /usr/python3.7
cd Python-3.7.5
./configure --prefix=/usr/python3.7
make && make install

创建软链接

ln -s /usr/python3.7/bin/python3 /usr/bin/python3.7
ln -s /usr/python3.7/bin/pip3 /usr/bin/pip3.7

验证是否安装成功

python3.7 -V
pip3.7 -V

如下所示,证明配置成功:

  

 因为执行yum需要python2版本,所以我们还要修改yum的配置

vim /usr/bin/yum

#! /usr/bin/python修改为#! /usr/bin/python2
vim /usr/libexec/urlgrabber-ext-down
#! /usr/bin/python 也要修改为#! /usr/bin/python2

确保安装必要软件(重要)

# 安装airflow pip版本过低会导致安装失败
pip3.7 install --upgrade pip
sudo pip3.7 install pymysql
sudo pip3.7 install celery
sudo pip3.7 install flower
sudo pip3.7 install psycopg2-binary

二、安装Airflow(重要)

注意: 2.1,2.2,2.3部分需要在所有安装节点进行操作

2.1 配置 airflow sudo权限

这里使用airflow用户进行

配置airflow用户 sudo权限

# 以下命令使用root用户
useradd airflow
vi /etc/sudoers

## Allow root to run any commands anywhere 
rootALL=(ALL) ALL
airflow    ALL=(ALL)                NOPASSWD: ALL #加入这一行

2.2 设置Airflow环境变量

安装完后airflow安装路径默认为: /home/airflow/.local/bin

#使用root用户执行
vi /etc/profile
export PATH=$PATH:/usr/python3.7/bin:/home/airflow/.local/bin
source /etc/profile

此处的/home/airflow/.local/bin 为~/.local/bin
根据实际配置PATH=$PATH:~/.local/bin

#配置环境变量,使用airflow用户执行(可选,默认为~/airflow)
export AIRFLOW_HOME=~/airflow

2.3 安装airflow

su airflow #root用户
# 以下命令使用airflow用户
AIRFLOW_VERSION=2.1.1
PYTHON_VERSION="$(python3.7 --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-no-providers-${PYTHON_VERSION}.txt"
# 这里要加sudo,否则会存在部分缺失,并且没有报错,这里要注意添加mysql,celery,cncf.kubernetes依赖,否则后续启动airflow时会报错
sudo pip3.7 install "apache-airflow[mysql,celery,cncf.kubernetes]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" -i https://pypi.rasa.com/simple --use-deprecated=legacy-resolver

如果airflow安装正常,此时将能够使用airflow命令,并且airflow安装目录下有如下文件:

airflow.cfg
webserver_config.py

2.4 配置ariflow

airflow高可用架构如下:

修改{AIRFLOW_HOME}/airflow.cfg文件

# 在{AIRFLOW_HOME}/airflow.cfg 添加或者修改如下配置

# 1. 修改Executor配置
# executor = LocalExecutor
executor = CeleryExecutor
# 2. 修改元数据库(metestore)配置
#sql_alchemy_conn = sqlite:home/apps/airflow/airflow.db
sql_alchemy_conn = mysql+pymysql://airflow:yourpassword@hostname:3306/airflow

# 3.设置消息队列broker,此处使用 RabbitMQ
# broker_url = redis://redis:6379/0
broker_url = amqp://admin:yourpassword@hostname:5672/
# 4.设定结果存储后端backend
# result_backend = db+postgresql://postgres:airflow@postgres/airflow
result_backend = db+mysql://airflow:yourpassword@hostname:3306/airflow

# 5. 修改时区
# default_timezone = utc
default_timezone = Asia/Shanghai
default_ui_timezone = Asia/Shanghai

# 6. 配置web端口(默认8080,因为被ambari占用所以改为8081)
endpoint_url = http://localhost:8081
base_url = http://localhost:8081
web_server_port = 8081

修改后的{AIRFLOW_HOME}/airflow.cfg需要同步到所有安装airflow的服务器上

同时,需要根据dags_folder,base_log_folder配置创建相关目录,防止后面执行dag时报错

2.5 启动airflow集群

初始化数据库

airflow db init

mysql中出现如下表结构证明初始化成功

 创建用户:

airflow users create \
    --username admin \
    --firstname Lixiaolong \
    --lastname Bigdata \
    --role Admin \
    --email spiderman@superhero.org

根据控制台输出设置Password

Password设置为:yourpassword

启动webserver:

airflow webserver -D

启动scheduler

nohup airflow scheduler &;

启动worker

# 先启动flower,在需要启动worker服务器执行
airflow celery flower -D
# 启动worker,在需要启动worker服务器执行
airflow celery worker -D

2.6 登录webui查看

webui: http://master1:8081/
账号: admin
密码: 2.5阶段设置的密码

界面显示如下图:

 worker的信息可以通过http://hostip:5555进行查看,如下图:

2.7 使用Airflow配置作业

 Airflow默认配置了32Dag供大家食用,webui选中Dag点击一下,即可变成Active状态

接下来,以常用的Hive Operator举例,如何编写并执行自定义Dag

依赖安装

使用Hive Operator,需要首先安装Hive相关依赖

如果使用中遇到类似如下的问题:

ModuleNotFoundError: No module named 'airflow.providers.apache'

就需要手动安装hive依赖,命令如下

su airflow
pip3.7 install airflow[hive]

Dag编写

Dag目录: 见airflow.cfg配置项dags_folder

将写好的python文件放置该目录下,举例:
该示例为定时 每隔一分钟查询hive表中数据,Dag名称为test_hive2

from airflow import DAG
from airflow.providers.apache.hive.operators.hive import HiveOperator
from datetime import datetime, timedelta
from airflow.models import Variable
from airflow.utils.dates import days_ago

default_args = {
    'owner': 'airflow',
    'depends_on_past': True,
    'start_date': days_ago(1),

    'retries': 10,
    'retry_delay': timedelta(seconds=5),

}

dag = DAG('test_hive2', default_args=default_args, schedule_interval='*/1 * * * *',  catchup=False)

t1 = HiveOperator(
    task_id='hive_task',
    hql='select * from test.data_demo',
    dag=dag)
t1

如果Dag格式正确,将会在webui上刷新出新添加的dag信息

 配置Connection

如下图所示,界面点击admin–>Connections配置connection

 

 hive默认使用的connectionshive_cli_default

 

 需要注意下图中标记出来的几个配置项:
Conn Type选择Hive Client Wrapper(如果安装了hive依赖,默认就是这个)
Host设置为安装了Hive的节点
Login需要设置为一个有权限执行hive任务的用户

配置完成,保存即可

任务调度

如下图所示,为开启调度,和手动触发

 任务触发后,点击任务栏中间的部分,可以查看任务运行细节,

 举例,点进去一个任务之后,我们可以看到它的运行细节运行日志

三. 遇到的问题

3.1 python模块下载报错

Collecting flask-appbuilder<2.0.0,>=1.12.2; python_version < "3.6"
  Using cached Flask-AppBuilder-1.13.1.tar.gz (1.5 MB)
    ERROR: Command errored out with exit status 1:
     command: /bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-EFxJZq/flask-appbuilder/setup.py'"'"'; __file__='"'"'/tmp/pip-install-EFxJZq/flask-appbuilder/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-StYjJL
         cwd: /tmp/pip-install-EFxJZq/flask-appbuilder/
    Complete output (3 lines):
    /usr/lib64/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'long_description_content_type'
      warnings.warn(msg)
    error in Flask-AppBuilder setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers

解决方案:
setuptools升级到最新版即可

pip install setuptools -U

3.2 执行ariflow相关命令报错 error: sqlite C library version too old (< {min_sqlite_version}). 

详细报错如下:

Traceback (most recent call last):
  File "/usr/python3.7/bin/airflow", line 5, in <module>
    from airflow.__main__ import main
  File "/usr/python3.7/lib/python3.7/site-packages/airflow/__init__.py", line 34, in <module>
    from airflow import settings
  File "/usr/python3.7/lib/python3.7/site-packages/airflow/settings.py", line 35, in <module>
    from airflow.configuration import AIRFLOW_HOME, WEBSERVER_CONFIG, conf  # NOQA F401
  File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 1114, in <module>
    conf.validate()
  File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 202, in validate
    self._validate_config_dependencies()
  File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 243, in _validate_config_dependencies
    f"error: sqlite C library version too old (< {min_sqlite_version}). "
airflow.exceptions.AirflowConfigException: error: sqlite C library version too old (< 3.15.0). See https://airflow.apache.org/docs/apache-airflow/2.1.1/howto/set-up-database.rst#setting-up-a-sqlite-database

原因: airflow默认使用sqlite作为metastore,但我们使用的是mysql,实际上用不到sqlite

解决方案:修改{AIRFLOW_HOME}/airflow.cfg,
将元数据库信息sql_alchemy_conn修改为

sql_alchemy_conn = mysql+pymysql://airflow:yourpassword@hostname:3306/airflow`

3.3 执行airflow db init失败 Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql

  File "/usr/python3.7/lib/python3.7/site-packages/airflow/migrations/versions/0e2a74e0fc9f_add_time_zone_awareness.py", line 44, in upgrade
    raise Exception("Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql")
Exception: Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql

解决方法:
进入mysql airflow 数据库,设置global explicit_defaults_for_timestamp

SHOW GLOBAL VARIABLES LIKE '%timestamp%';
SET GLOBAL explicit_defaults_for_timestamp =1;

设置前:

 

 设置后:

 
posted @ 2021-09-13 11:08  vekair  阅读(1421)  评论(0编辑  收藏  举报