Skip to content

Commit

Permalink
Merge pull request #21 from axiom-of-choice/feature/refactor-hourly-p…
Browse files Browse the repository at this point in the history
…ipeline

Feature/refactor hourly pipeline
  • Loading branch information
axiom-of-choice authored Sep 18, 2023
2 parents bfa6312 + 226780d commit e1a5925
Show file tree
Hide file tree
Showing 5 changed files with 9 additions and 24 deletions.
32 changes: 9 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,40 +27,26 @@ I used Pandas as data manipulation layer because it offers a complete solution t
### File structure
The file directory follows and standard etl pipeline structure, we have:
* *airflow/* Includes:
- dags/ directory where it's located the weather_etl module that has extract, load, transform and utils submodules that we will be using into the pipeline_weather DAG to mantain the DAG file clean and organized
- dags/ directory where it's located the etls modules and custom operators that we will be using into the pipelines DAG to mantain the DAG file clean and organized
- *data/* Includes:
- *data_municipios/*: Where it's stored data about municipios (It should be in another place like a Database or storage service)
- *intermediate/*: We use this as an intermediate pre-processed storage layer that stores the data requested to the web-service uncompressed into json format (But in production we should use some S3 or Google Cloud Storage)
- *processed/*
- */process_1* It stores the data required for second point of the exercise (We should store it into a external Database or Storage)
- */process_1* It stores the data required for third and fourth point of the exercise (We should store it into a external Database or Storage)
- *raw/*: Stores the raw data requested to the web-service in a compressed gzip format (But in production we should use some S3 or Google Cloud Storage)
- */logs*: Stores the logs of the DAG execution generated by Airflow
- */plugins*: Airflow plugins directory
- *data_municipios/*: Where it's stored static data about municipios (It should be in another place like a Database or storage service)
- Airflow config files.: *aiflow.cfg, airflow.db, webserver_config.py*
* */scripts* Bash scripts
* */Tests* Unit testing
* *example.env* file
- queries.toml (File containing queries to be run into BigQuery)
* *example.env* Env file of example
* *requirements.txt* File
* *docker-compose.yml* File
* Dockerfile for building the custom docker image
* */example_data* Contains some sample executions
### Logic
I just did the following steps to complete the solution:
1. Extract phase: Request the endpoint with the extract submodule that extracts the raw compressed file, stores it in the raw layer, uncompress the file and transform it in json format, store it in intermediate layer.
2. Transform phase: Generate the table 1 using the transform submodule and **push the table using the XCOM Airflow backend (we should store it in some place first, but for the purpouses of this exercise this works)** and then generate the table that queries this file using the XCOM backend, generate the second table that depends directly into first table and then:
3. Load phase: Write the generated tables into the local storage labelling properly the datasets (We should store it somewhere outside the container or local storage)
5. Cleaning phase: Just clean the staging folders containing raw and intermediate data also the XCOM storage.
1. Extract phase: Request the endpoint with the extract submodule that extracts the raw compressed file, stores it in an S3 bucket.
2. Load phase: Generate the table 1 using custom operators and push the raw table into BigQuery prior schema and data type validations.
3. Transform phase: Write aggregated tables into BigQuery from existing raw table.

### Strengths of the solution
1. It uses state of the art tools like Docker and Airflow to execute and orchestrate the pipeline virtually everywhere
2. Modularized, atomic and well-documented code

### Weaknesses
1. It doesn't use external storage and it should.
2. Some functions are not optimal having some file routes harcoded for practical purpouses
3. I didn't include tests but in a production environment it should.

### Further versions improvements and how to scale, organize and automate the solution:
1. Use cloud managed airflow
2. Include a CI/CD/QA/DQ Flow
3. Use external cloud storage everywhere is possible
2. Include a CI/CD/QA/DQ Flow
Empty file.
Empty file.
Empty file removed airflow/data/raw/.gitkeep
Empty file.
1 change: 0 additions & 1 deletion response.json

This file was deleted.

0 comments on commit e1a5925

Please sign in to comment.