airflow笔记
官网:http://airflow.incubator.apache.org/project.html
Here we pass a string that defines the dag_id, which serves as a unique identifier for your DAG.
The first argument task_id acts as a unique identifier for the task.
The precedence rules for a task are as follows:
Explicitly passed arguments
Values that exist in the default_args dictionary
The operator’s default value, if one exists
A task must include or inherit the arguments task_id and owner
Let’s assume we’re saving the code from the previous step in tutorial.py in the DAGs folder referenced in your airflow.cfg.
The default location for your DAGs is ~/airflow/dags.
Note that if you use depends_on_past=True, individual task instances will depend on the success of the preceding task instance, except for the start_date specified itself, for which this dependency is disregarded.
You can also set options with environment variables by using this format: $AIRFLOW__{SECTION}__{KEY}
================================
# print the list of active DAGs
airflow list_dags
# prints the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial
airflow backfill :airflow backfill tutorial -s 2015-06-01 -e 2015-06-07
airflow test :It simply allows testing a single task instance.
airflow webserver :will start a web server
===============
1 “LocalExecutor” :an executor that can parallelize task instances locally.
2 配置文件所在路径:$AIRFLOW_HOME/airflow.cfg,配置文件中的sql_alchemy_conn 指向源数据数据库的地址
3 AIRFLOW_HOME
的默认值:~/airflow
4 Admin->Connection : The pipeline code you will author will reference the ‘conn_id’ of the Connection objects
5 环境变量里的值的优先级高于配置文件中对应的值
6 连接的环境变量必须有前缀AIRFLOW_CONN_,环境变量必须是全大写,if the conn_id
is named postgres_master
the environment variable should be named AIRFLOW_CONN_POSTGRES_MASTER
代表连接的环境变量的返回值应该是URI格式,如postgres://user:password@localhost:5432/master
or s3://accesskey:secretkey@S3
7 Users can specify a logs folder in airflow.cfg
. By default, it is in the AIRFLOW_HOME
directory.
Logs are stored in the log folder as {dag_id}/{task_id}/{execution_date}/{try_number}.log
.
8 operator :The airflow/contrib/
directory contains yet more operators built by the community
9 a) SubDAG operators should contain a factory method that returns a DAG object.
b)SubDAGs must have a schedule and be enabled.
c ) refrain from using depends_on_past=True
in tasks within the SubDAG as this can be confusing
d) It is common to use the SequentialExecutor if you want to run the SubDAG in-process and effectively limit its parallelism to one. Using LocalExecutor can be problematic
10
11 if you run a DAG on a schedule_interval
of one day, the run stamped 2016-01-01
will be trigger soon after 2016-01-01T23:59
12 The scheduler starts an instance of the executor specified in the your airflow.cfg
. If it happens to be the LocalExecutor
, tasks will be executed as subprocesses; in the case of CeleryExecutor
andMesosExecutor
, tasks are executed remotely.
13 Airflow 可以为任意一个 Task 指定一个抽象的 Pool,每个 Pool 可以指定一个 Slot 数。 每当一个 Task 启动时,就占用一个 Slot,当 Slot 数占满时,其余的任务就处于等待状态
14 上一轮的某个dag的处理时间可能很长,导致到下一轮处理的时候这个dag还没有处理完成。 Airflow 的处理逻辑是在这一轮不为这个dag创建进程,这样就不会阻塞进程去处理其余dag。
15 A task must include or inherit the arguments task_id and owner, otherwise Airflow will raise an exception.task支持的详细参数可以看下BaseOperator的构造方法
16 通过retry_exponential_backoff实现重试间隔越来越长
通过wait_for_downstream实现上次dag没执行完,则这次不执行
通过weight_rule计算每个task的优先级
通过execution_timeout控制task的超时时间
通过trigger_rule控制task执行的触发条件
通过task_concurrency 控制同一个task可以并行执行的个数
17 airflow test不判断任务的依赖关系,直接执行
18 airflow偶尔占用内存太高问题定位:因为历史task挂死,一直没执行完,是running状态,随着时间的积累,
导致处于running和queue状态的任务大于concurrency了,后续生成的taskinstance都是scheduled状态。而每次scheduler每次调度任务时,都会取出scheduled状态的任务,进行排序等操作,因为scheduled状态的任务太多,所以占用了很大内存
19 模板参数:
{ 'dag': task.dag, 'ds': ds, 'ds_nodash': ds_nodash, 'ts': ts, 'ts_nodash': ts_nodash, 'yesterday_ds': yesterday_ds, 'yesterday_ds_nodash': yesterday_ds_nodash, 'tomorrow_ds': tomorrow_ds, 'tomorrow_ds_nodash': tomorrow_ds_nodash, 'END_DATE': ds, 'end_date': ds, 'dag_run': dag_run, 'run_id': run_id, 'execution_date': self.execution_date, 'prev_execution_date': prev_execution_date, 'next_execution_date': next_execution_date, 'latest_date': ds, 'macros': macros, 'params': params, 'tables': tables, 'task': task, 'task_instance': self, 'ti': self, 'task_instance_key_str': ti_key_str, 'conf': configuration, 'test_mode': self.test_mode, 'var': { 'value': VariableAccessor(), 'json': VariableJsonAccessor() } }
20