AirFlow容器部署和使用
一、如何制作AirFlow容器
1、安装docker环境 基于centos环境下进行部署,建议在centos6或者centos7的环境下 1.1、下载docker安装包 下载地址:https://download.docker.com/linux/static/stable/x86_64/ 推荐使用的版本是18.09.6 1.2、下载到本地后解压 tar -zxf docker-18.09.6.tgz 1.3、将解压出来的docker文件内容移动到 /usr/bin/ 目录下 cp docker/* /usr/bin/ 1.4、将docker注册为service 新建文件 vim /etc/systemd/system/docker.service 并添加以下内容 [Unit] Description=Docker Application Container Engine Documentation=https://docs.docker.com After=network-online.target firewalld.service Wants=network-online.target [Service] Type=notify # the default is not to use systemd for cgroups because the delegate issues still # exists and systemd currently does not support the cgroup feature set required # for containers run by docker ExecStart=/usr/bin/dockerd ExecReload=/bin/kill -s HUP $MAINPID # Having non-zero Limit*s causes performance problems due to accounting overhead # in the kernel. We recommend using cgroups to do container-local accounting. LimitNOFILE=infinity LimitNPROC=infinity LimitCORE=infinity # Uncomment TasksMax if your systemd version supports it. # Only systemd 226 and above support this version. #TasksMax=infinity TimeoutStartSec=0 # set delegate yes so that systemd does not reset the cgroups of docker containers Delegate=yes # kill only the docker process, not all processes in the cgroup KillMode=process # restart the docker process if it exits prematurely Restart=on-failure StartLimitBurst=3 StartLimitInterval=60s [Install] WantedBy=multi-user.target 添加文件权限 chmod +x /etc/systemd/system/docker.service systemctl daemon-reload 1.5、启动docker systemctl start docker 1.6、验证 systemctl status docker #查看Docker状态 docker -v #查看Docker版本
2. 在Docker环境安装AirFlow 2.1、下载源码到/root/airflow文件夹 git clone https://github.com/puckel/docker-airflow.git /root/airflow 2.2、运行容器 运行容器命令: docker run --net=bridge --name AirFlow -e MYSQL_IP_PORT="172.16.117.125:3306/airflow" -e MYSQL_USERNAME="root" -e MYSQL_PASSWORD="123456" -v /usr/local/airflow/dags:/usr/local/airflow/dags -v /usr/local/airflow/airflowSql:/usr/local/airflow/airflowSql -v /usr/local/airflow/airflow.cfg:/usr/local/airflow/airflow.cfg -id -p 8081:8080 --privileged=true puckel/docker-airflow 解释: AirFlow:容器的名称 MYSQL_IP_PORT:mysql数据库的ip地址:端口号/数据库名称 MYSQL_USERNAME:登录mysql数据库的用户名 MYSQL_PASSWORD:登录mysql的密码 -v /usr/local/airflow/dags:/usr/local/airflow/dags 宿主机的存放dag文件目录:容器存放dag文件目录 -v /usr/local/airflow/airflowSql:/usr/local/airflow/airflowSql 宿主机的存放执行脚本文件目录:容器存放执行脚本文件目录 -v /usr/local/airflow/airflow.cfg:/usr/local/airflow/airflow.cfg 将airflow的配置文件映射到宿主机 puckel/docker-airflow 镜像名称 2.3、进入容器 docker exec -it -u root AirFlow bash /* 默认是进入到容器的/usr/local/airflow目录下(airflow的默认安装目录) */ 2.4、修改配置文件 vim airflow.cfg dags_folder =$AIRFLOW_HOME/dags #DAG文件存放的目录 base_log_folder = $AIRFLOW_HOME/logs #运行日志存放目录 executor = LocalExecutor sql_alchemy_conn = mysql://$MYSQL_USERNAME:$MYSQL_PASSWORD@$MYSQL_IP_PORT load_examples = False dags_are_paused_at_creation = False 2.5、初始化数据库 airflow initdb 如果初始化出现这样的错误: airflow.exceptions.AirflowException: Could not create Fernet object: Incorrect padding 解决办法: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" export AIRFLOW__CORE__FERNET_KEY=oNu9XwewQNyx9mAJT2vZvtm3qzPRZIWRqwk9hSVch4A= airflow initdb // 重新运行初始化数据库 2.6、后台运行 后台运行服务webserver和scheduler nohup airflow webserver>>$AIRFLOW_HOME/airflow-webserver.log 2>&1 & 后台运行调度 nohup airflow scheduler>>$AIRFLOW_HOME/airflow-scheduler.log 2>&1 & 2.7、在浏览器打开地址: 172.16.117.125:8081
二、如何将部署好的AirFlow容器迁移到其他服务器
/* 在容器迁移之前,先给容器安装几个常用的命令,考虑到目标服务器可能不能联网 */ 1、安装 vim ping ifconfig 等常用命令 apt-get update apt-get install vim //安装vim apt-get install net-tools //安装ifconfig apt-get install iputils-ping //安装ping 2、将配置好的airflow容器制作成镜像 docker commit 0e3d77afccc3 airflow /* docker commit 容器ID 镜像名称 */ 3、将镜像保存为一个文件包 docker save -o airflow.tar airflow 4、将该文件包拷贝到需要迁移的服务器上 5、在新的服务器上把文件包加载成镜像 docker load -i airflow.tar 6、通过新导入的镜像来启动容器 docker run --net=bridge --name AirFlow --hostname airflow -e MYSQL_IP_PORT="172.16.117.125:3306/airflow" -e MYSQL_USERNAME="root" -e MYSQL_PASSWORD="123456" -v /usr/local/airflow/dags:/usr/local/airflow/dags -v /usr/local/airflow/airflowSql:/usr/local/airflow/airflowSql -v /usr/local/airflow/airflow.cfg:/usr/local/airflow/airflow.cfg -id -p 8084:8080 --privileged=true airflow 解释: AirFlow:容器的名称 MYSQL_IP_PORT:mysql数据库的ip地址:端口号/数据库名称 MYSQL_USERNAME:登录mysql数据库的用户名 MYSQL_PASSWORD:登录mysql的密码 -v /usr/local/airflow/dags:/usr/local/airflow/dags 宿主机的存放dag文件目录:容器存放dag文件目录 -v /usr/local/airflow/airflowSql:/usr/local/airflow/airflowSql 宿主机的存放执行脚本文件目录:容器存放执行脚本文件目录 -v /usr/local/airflow/airflow.cfg:/usr/local/airflow/airflow.cfg 将airflow的配置文件映射到宿主机 airflow 镜像名称 7、进入容器 docker exec -it -u root AirFlow bash /* 默认是进入到容器的/usr/local/airflow目录下(airflow的默认安装目录) */ 8、修改配置文件 vim airflow.cfg dags_folder =$AIRFLOW_HOME/dags #DAG文件存放的目录 base_log_folder = $AIRFLOW_HOME/logs #运行日志存放目录 executor = LocalExecutor sql_alchemy_conn = mysql://$MYSQL_USERNAME:$MYSQL_PASSWORD@$MYSQL_IP_PORT load_examples = False dags_are_paused_at_creation = False 9、初始化数据库 airflow initdb 如果初始化出现这样的错误: airflow.exceptions.AirflowException: Could not create Fernet object: Incorrect padding 解决办法: python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())" export AIRFLOW__CORE__FERNET_KEY=oNu9XwewQNyx9mAJT2vZvtm3qzPRZIWRqwk9hSVch4A= airflow initdb // 重新运行初始化数据库 10、后台运行 后台运行服务webserver和scheduler nohup airflow webserver>>$AIRFLOW_HOME/airflow-webserver.log 2>&1 & 后台运行调度 nohup airflow scheduler>>$AIRFLOW_HOME/airflow-scheduler.log 2>&1 & 11、在浏览器打开地址: 172.16.117.125:8084 /* 新的服务器ip地址:对应服务器的端口号(我这里是8084) */
三、如何使用AirFlow容器
1、将dag任务文件放到/usr/local/airflow/dags目录下(这个根据前面的配置来定) 2、调度任务在airflow所在服务器的模板 import airflow import time from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator from datetime import datetime,timedelta default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2019, 12, 17,17,12,1), 'retries': 5, 'retry_delay': timedelta(seconds=5), } dag = DAG( 'c_test', default_args=default_args, description='my second DAG', schedule_interval=timedelta(minutes=1) ) filename1='/usr/local/airflow/test/a1.txt' filename2='/usr/local/airflow/test/a2.txt' filename3='/usr/local/airflow/test/a3.txt' def print_hello1(): print("Hello World!1111111") current_time = time.asctime( time.localtime(time.time()) ) with open(filename1,'a') as f: f.write(current_time) def print_hello2(): print("Hello World!22222222") current_time = time.asctime( time.localtime(time.time()) ) with open(filename2,'a') as f: f.write(current_time) def print_hello3(): print("Hello World!33333333") current_time = time.asctime( time.localtime(time.time()) ) with open(filename3,'a') as f: f.write(current_time) task1 = PythonOperator( task_id='task_1', python_callable=print_hello1, dag=dag) task2 = PythonOperator( task_id='task_2', python_callable=print_hello2, dag=dag) task3 = PythonOperator( task_id='task_3', python_callable=print_hello3, dag=dag) task2.set_upstream(task1) task3.set_upstream(task1) 3、调度任务在远程服务器模板 from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.operators import ExternalTaskSensor from airflow.operators import EmailOperator from datetime import datetime, timedelta from airflow.contrib.hooks.ssh_hook import SSHHook from airflow.contrib.operators.ssh_operator import SSHOperator sshHook = SSHHook(remote_host='172.16.117.126',username='root',password='GXcxkfbrgx@26',timeout=30) default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2019, 12, 27,10,22,0), 'retries': 3, 'retryDelay': timedelta(seconds=5), 'end_date': datetime(9999, 12, 31) } dag = DAG('hello', default_args=default_args, schedule_interval='0 * * * *') hello = SSHOperator( ssh_hook=sshHook, task_id='hello', dag=dag, command='/opt/sh/hello.sh ' ) hello /* sshHook = SSHHook(remote_host='172.16.117.126',username='root',password='GXcxkfbrgx@26',timeout=30) sshHook = SSHHook(remote_host='远程服务器ip地址',username='用户名',password='密码',timeout=30) */