This is the demo project for the talk Reproducible and Deployable Data Science with Open-Source Python at EuroPython 2021. In a nutshell, it takes a realistic Jupyter Notebook and related utilities, turns it into a Python project with Kedro, deploys it to an Airflow cluster and integrates Great Expectations as automated data quality checking.
- Kedro Documentation
- Great Expectations Documentation
- Airflow Documentation
- Astronomer's guide to deploy Kedro pipeline with Airflow
- Talk on Reproducible and maintainable data science code with Kedro by Yetunde Dada at PyCon US 2021.
- Microsoft Recommenders
Create a new virtual environment, for example with conda:
conda create -n europython-2021-py37 python=3.7Install the project's dependencies:
pip install -r src/dev_requirements.txtkedro runkedro viz
# or `kedro viz --autoreload` if you want the visualisation to autoreload on file changes.During a Kedro run, a data validation hook using Great Expectations will be called automatically for the cleaned_movies dataset. To view the data docs which shows the validation result, open conf/base/great_expectations/uncommitted/data_docs/local_site/index.html.
The design is simple: before a dataset is saved, if an expectation suite matching the dataset name exists, the dataset will be validated thanks to the before_dataset_saved hooks defined in hooks.py.
Exercise: add more expectation suites for other datasets in the project.
The airflow DAGs are located under dags/. The one called europuython_2021_demo_dag.py is automatically generated using kedro-airflow, which corresponds to the following deployment:
The other called grouped_nodes_dag.py is manually apdated from the original DAG to demonstrate the concept that you can deploy a Kedro pipeline with into a Kedro DAG with much lower granularity:
Install Astronomer CLI then run the following command:
kedro package
cp src/dist/*.whl ./
astro dev startOpen http://localhost:8080 to view the Airflow UI.
Read the deployment guide of a Kedro pipeline to Airflow using Astronomer guide here for more details.
pytestExercise: Currently there are only a few tests copied verbatim from the original notebook's utilities. Write more tests, especially integration tests, for the pipelines and setup a CI pipeline to run the test suite on every commit.


