Merge pull request #21 from axiom-of-choice/feature/refactor-hourly-p…

…ipeline Feature/refactor hourly pipeline
axiom-of-choice · Sep 18, 2023 · e1a5925 · e1a5925
2 parents bfa6312 + 226780d
commit e1a5925
Show file tree

Hide file tree

Showing 5 changed files with 9 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -27,40 +27,26 @@ I used Pandas as data manipulation layer because it offers a complete solution t
 ### File structure
 The file directory follows and standard etl pipeline structure, we have:
 *   *airflow/* Includes:
-    - dags/ directory where it's located the weather_etl module that has extract, load, transform and utils submodules that we will be using into the pipeline_weather DAG to mantain the DAG file clean and organized
+    - dags/ directory where it's located the etls modules and custom operators that we will be using into the pipelines DAG to mantain the DAG file clean and organized
     - *data/* Includes:
-        - *data_municipios/*: Where it's stored data about municipios (It should be in another place like a Database or storage service)
-        - *intermediate/*: We use this as an intermediate pre-processed storage layer that stores the data requested to the web-service uncompressed into json format (But in production we should use some S3 or Google Cloud Storage)
-        - *processed/*
-            - */process_1* It stores the data required for second point of the exercise (We should store it into a external Database or Storage)
-            - */process_1* It stores the data required for third and fourth point of the exercise (We should store it into a external Database or Storage)
-        - *raw/*: Stores the raw data requested to the web-service in a compressed gzip format (But in production we should use some S3 or Google Cloud Storage)
-    - */logs*: Stores the logs of the DAG execution generated by Airflow
-    - */plugins*: Airflow plugins directory
+        - *data_municipios/*: Where it's stored static data about municipios (It should be in another place like a Database or storage service)
     - Airflow config files.: *aiflow.cfg, airflow.db, webserver_config.py*
-* */scripts* Bash scripts
-* */Tests* Unit testing
-* *example.env* file
+    - queries.toml (File containing queries to be run into BigQuery)
+* *example.env* Env file of example 
 * *requirements.txt* File
 * *docker-compose.yml* File
+* Dockerfile for building the custom docker image
 * */example_data* Contains some sample executions
 ### Logic
 I just did the following steps to complete the solution: 
-1. Extract phase: Request the endpoint with the extract submodule that extracts the raw compressed file, stores it in the raw layer, uncompress the file and transform it in json format, store it in intermediate layer.
-2. Transform phase: Generate the table 1 using the transform submodule and **push the table using the XCOM Airflow backend (we should store it in some place first, but for the purpouses of this exercise this works)** and then generate the table that queries this file using the XCOM backend, generate the second table that depends directly into first table and then:
-3. Load phase: Write the generated tables into the local storage labelling properly the datasets (We should store it somewhere outside the container or local storage)
-5. Cleaning phase: Just clean the staging folders containing raw and intermediate data also the XCOM storage.
+1. Extract phase: Request the endpoint with the extract submodule that extracts the raw compressed file, stores it in an S3 bucket.
+2. Load phase: Generate the table 1 using custom operators and push the raw table into BigQuery prior schema and data type validations.
+3. Transform phase: Write aggregated tables into BigQuery from existing raw table.
 
 ### Strengths of the solution
 1. It uses state of the art tools like Docker and Airflow to execute and orchestrate the pipeline virtually everywhere
 2. Modularized, atomic and well-documented code
 
-### Weaknesses
-1. It doesn't use external storage and it should.
-2. Some functions are not optimal having some file routes harcoded for practical purpouses
-3. I didn't include tests but in a production environment it should.
-
 ### Further versions improvements and how to scale, organize and automate the solution:
 1. Use cloud managed airflow 
-2. Include a CI/CD/QA/DQ Flow
-3. Use external cloud storage everywhere is possible 
+2. Include a CI/CD/QA/DQ Flow
diff --git a/airflow/data/processed/process_1/.gitkeep b/airflow/data/processed/process_1/.gitkeep
diff --git a/airflow/data/processed/process_2/.gitkeep b/airflow/data/processed/process_2/.gitkeep
diff --git a/airflow/data/raw/.gitkeep b/airflow/data/raw/.gitkeep
diff --git a/response.json b/response.json