Apache Airflow is a job scheduler that was originally developed inside of Airbnb to streamline data ETL operations. Since it was open-sourced back in June, 2015, Airflow has grown in popularity among data teams due to its functionality, scalability, and, of course, its Pythonic implementation. Creating scheduled tasks then is about as easy as creating a Python script and saving it to the Airflow repository, which makes it a very nice tool to have for data scientists that do most of their work in Python. Even jobs with many constituent parts, each with their own dependencies, can be easily created or modified to resemble something like the example below.

However, while Airflow can be run as an entirely standalone application, we lose a good amount of the functionality and scalability by doing so, rendering our considerations regarding scalability and functionality moot. To support these two aspects, Airflow requires a database and a message broker. Additionally, Airflow can be broken up into 4 distinct services: a webserver, a task scheduler, a worker, and an optional UI for worker monitoring. Because all of these services serve unique purposes, with some having the potential to scale up their compute requirements to the detriment of the other services, it makes sense that we should containerize these services. To that end, Docker can help us by splitting each of these services into their own containers that can be allocated to run on a single machine, or an entire cluster.

In this implementation, each of the above services have been allocated to their own containers, and, should you decide to try and deploy to a cluster, all of the worker containers are set to deploy to worker nodes. On top of this, I have also added a DAG builder function that simplifies creating and modifying DAGs by allowing the author to simply edit a Python dictionary that specifies the DAG dependency structure. I've also provisioned each airflow based container with an Anaconda Python environment, called: airflow_env, that is built using the conda and pip requirements text files at the time the containers are built. Finally, because every service is containerized, users can build a local copy of Airflow that makes use of the Celery executor by default, allowing the local copy to run multiple DAGs in parallel.

Now for the fun part!

To build a local copy of this project:

1. Make sure you have Docker installed on your machine and that the daemon is running
2. Download this project from my Github
3. Navigate to the newly downloaded repo in your terminal
4. Run:
• docker-compose build
• docker-compose up -d --scale worker=2

Now you should be able to view the Airflow UI at localhost:8080 and the worker monitoring service, Flower, at localhost:5555. You can now execute any of the example DAGs that are packaged in this project, or add your own by adding a new DAG definition file to the dags directory. And, because the postgres database is backed up as a Docker volume, your historical data, logs, and connection data will persist for as long as you decide to retain the volume.

To terminate your local instance of Airflow, just run: docker-compose down


To distribute this project across a cluster:

1. Make sure you have Docker installed on each node and that the daemon is running
2. Ensure that all nodes have TCP port 2377, TCP and UDP port 7946, and UDP port 4789 exposed
3. Download this project from my Github on your main master node
4. Navigate to the newly downloaded repo in your terminal
5. Run: docker swarm init
6. Add new master or worker nodes to the swarm using the token displayed from step 5
7. Ensure that you have at least one worker node in the swarm, otherwise DAGs will not execute
8. Deploy your stack to the swarm: docker stack deploy --compose-file docker-compose.yaml dairflow

You should now have an Airflow cluster that will assign tasks to its workers as tasks are added to the queue. However, there is one thing to note: autoscaling is purposfully left up to your own implementation. Meaning that once you deploy Airflow to your cluster, your master and worker nodes will stay scaled to your initial configuration. But, since you have a job scheduler now, that should be a bit easier!

If you have any questions or concerns, please feel free to reach out or raise an issue on Github . And as always, the Github page has some additional info about features like email and connection encryption.