Apache Airflow is a python-based tool for programmatically developing, scheduling and monitoring workflows.
This repository is intended to allow the user to duplicate the environment used for this walkthrough. It requires Vagrant and a virtualisation provider such as VirtualBox and all commands below are for a bash-like shell.
This vagrantfile is set up to attempt to create a VM with 8GB of RAM, edit the Vagrantfile if your system doesn't have this much memory!
It is also possible to install Airflow through pip
.
$ git clone
# install the docker-compose plugin
$ vagrant plugin install vagrant-docker-compose
$ vagrant up
# to enter VM
$ vagrant ssh
Airflow can also be installed as a local instance using pip
the python package manager. It is a little more involved than just pip install airflow
so the official steps are included below for reference.
# Airflow needs a home. `~/airflow` is the default, but you can put it
# somewhere else if you prefer (optional)
export AIRFLOW_HOME=~/airflow
# Install Airflow using the constraints file
AIRFLOW_VERSION=2.2.3
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
# For example: 3.6
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
# For example: https://raw.githubusercontent.com/apache/airflow/constraints-2.2.3/constraints-3.6.txt
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
# The Standalone command will initialise the database, make a user,
# and start all components for you.
airflow standalone
# Visit localhost:8080 in the browser and use the admin account details
# shown on the terminal to login.
# Enable the example_bash_operator dag in the home page
Now the pipeline defined in dafs/pipeline.py
isn't really a good example of a pipeline. It'll run once and work but subsequent runs will fail, but it roughly gives you a look at how airflow works. Maybe you can get it behaving like it should?
To spin it all up:
$ cd /home/vagrant/airflow
# this will run and exit with 0 if successful
$ docker-compose up airflow-init
$ docker-compose up
This will spin up all services and you should then navigate to https://localhost:8080 and log in with airflow
for both the username and password. You can then trigger the run by Unpausing the etl_pipeline
in the DAGS menu, or clicking the play button to the right hand side of the etl_pipeline
row.
If you'd rather use VSCode for exploring and editing this repository you can configure the Remote SSH plugin for VSCode to use your running Vagrant box with the following steps.
-
Install VSCode Remote SSH
-
Get vagrant SSH config details
$ vagrant ssh-config
-
Add these details to your default
.ssh/config
file or another config file you wish to use -
Start up your vagrant VM
$ vagrant up
-
Start a remote session via VSCode and select the name of the vagrant host
Adapted from Airflow Docs
Airflow is a tool that lets you build and run workflows which are based on a directed-acyclic-graph which contain individual tasks which have dependencies and data flows.
The DAG specifies all task dependencies, the order of tasks, and whether to attempt retries. The tasks describe themselves and their specific job be it fetching data, running analysis, detecting a change.
General components of an Airflow installation:
- A scheduler, handles triggering scheduled workflows, and submitting tasks to executors
- An executor, handles running tasks, can be run inside the scheduler, or in production pushes tasks out to workers to run
- A webserver, presents the UI
- A folder of DAG files, to be read by the scheduler and executor
- A metadata database, used by the scheduler, executor and webserver to store state
A task may be one of a number of common types: