Skip to content
This repository was archived by the owner on Oct 22, 2023. It is now read-only.

limdauto/europython-2021-demo

Repository files navigation

EuroPython 2021 Demo Project

Overview

This is the demo project for the talk Reproducible and Deployable Data Science with Open-Source Python at EuroPython 2021. In a nutshell, it takes a realistic Jupyter Notebook and related utilities, turns it into a Python project with Kedro, deploys it to an Airflow cluster and integrates Great Expectations as automated data quality checking.

Useful Links

  1. Kedro Documentation
  2. Great Expectations Documentation
  3. Airflow Documentation
  4. Astronomer's guide to deploy Kedro pipeline with Airflow
  5. Talk on Reproducible and maintainable data science code with Kedro by Yetunde Dada at PyCon US 2021.
  6. Microsoft Recommenders

Installation

Create a new virtual environment, for example with conda:

conda create -n europython-2021-py37 python=3.7

Install the project's dependencies:

pip install -r src/dev_requirements.txt

Run the pipeline

kedro run

View the pipeline visualisation

kedro viz

# or `kedro viz --autoreload` if you want the visualisation to autoreload on file changes.

Great-Expectations integration.

During a Kedro run, a data validation hook using Great Expectations will be called automatically for the cleaned_movies dataset. To view the data docs which shows the validation result, open conf/base/great_expectations/uncommitted/data_docs/local_site/index.html.

The design is simple: before a dataset is saved, if an expectation suite matching the dataset name exists, the dataset will be validated thanks to the before_dataset_saved hooks defined in hooks.py.

Exercise: add more expectation suites for other datasets in the project.

Deployment with Airflow

The airflow DAGs are located under dags/. The one called europuython_2021_demo_dag.py is automatically generated using kedro-airflow, which corresponds to the following deployment:

The other called grouped_nodes_dag.py is manually apdated from the original DAG to demonstrate the concept that you can deploy a Kedro pipeline with into a Kedro DAG with much lower granularity:

Install Astronomer CLI then run the following command:

kedro package
cp src/dist/*.whl ./
astro dev start

Open http://localhost:8080 to view the Airflow UI.

Read the deployment guide of a Kedro pipeline to Airflow using Astronomer guide here for more details.

Running the test

pytest

Exercise: Currently there are only a few tests copied verbatim from the original notebook's utilities. Write more tests, especially integration tests, for the pipelines and setup a CI pipeline to run the test suite on every commit.

About

Demo repository for the talk "Reproducible and Deployable Data Science with Open-Source Python" at EuroPython 2021.

Topics

Resources

Stars

Watchers

Forks