SpicyBytes P1 and P2 Delivery

P1: Landing Zone

Introduction

SpicyBytes is an innovative food and grocery management platform aimed at reducing food wastage by offering a sustainable shopping and selling experience for groceries nearing their expiration date.

Structure of Data in the repository

├───.github
│   └───workflows
├───dags
│   ├───allminogcs.py
│   ├───collector.py
│   ├───etl_exploitation_zone.py
│   ├───etl_formatted_zone.py
│   ├───expiry_notification.py
│   └───synthetic.py
├───data
│   └───raw
├───landing_zone
│   ├───collectors
│   │   ├───approved_food_uk
│   │   │   └───approvedfood_scraper
│   │   │       └───approvedfood_scraper
│   │   ├───big_basket
│   │   ├───catalonia_establishment_location
│   │   ├───customers
│   │   ├───eat_by_date
│   │   ├───flipkart
│   │   │   └───JSON_files
│   │   ├───meal_db
│   │   │   └───mealscraper
│   │   │       └───mealscraper
│   │   └───OCR
│   │       ├───images
│   │       └───output
│   └───synthetic
│       ├───customer_location
│       ├───customer_purchase
│       ├───sentiment_reviews
│       └───supermarket_products
├───formatted_zone
│   ├───business_review_sentiment.py
│   ├───customer_location.py
│   ├───customer_purchase.py
│   ├───customer_sales.py
│   ├───customers.py
│   ├───dynamic_pricing.py
│   ├───establishments_catalonia.py
│   ├───estimate_expiry_date.py
│   ├───estimate_perishability.py
│   ├───expiry_notification.py
│   ├───individual_review_sentiment.py
│   ├───location.py
│   └───mealdrecomend.py
├───exploitation_zone
│   ├───dim_cust_location.py
│   ├───dim_date.py
│   ├───dim_product.py
│   ├───dim_sp_location.py
│   ├───fact_business_cust_purchase.py
│   ├───fact_business_inventory.py
│   ├───fact_business_review.py
│   ├───fact_cust_inventory.py
│   ├───fact_cust_purchase.py
│   ├───fact_customer_review.py
│   └───schema.txt
└───readme_info

Data Sources

The data folder stores the raw data scraped using the scripts present in the landing_zone. The landing_zone consists of 2 types of data generation scripts:

collectors consist of data sources that have either been scraped or extracted through API requests from the corresponding webpages.
synthetic directory consists of data generated synthetically; using a composite of collected data sources and fake data generated using the python Faker library.

How to run the code

To execute the program, clone the repository.
Install the requirements using pip install -r requirements.txt.
Configure Airflow : Set up your Airflow environment by configuring settings such as the executor, database, and authentication method. Refer to the Airflow documentation for detailed instructions on configuring Airflow.
Verify that Apache Airflow is installed in your local machine and is running.
Start the Airflow webserver and scheduler using the following commands:
```
airflow webserver --port 8080
airflow scheduler
```
Access the Airflow UI: Open your web browser and navigate to http://localhost:8080.
Enable your DAG.

The collector.py DAG collects data on a monthly basis, while the synthetic.py DAG collects data on a daily basis.

High Level Data Architecture

The proposed high level architecture is employed for the P1 delivery methodology.

P2: Formatted Zone and Exploitation Zone

DAGs

We have created several DAGs to manage the workflows within the following zones:

Formatted Zone
- Manages the tasks related to data formatting and standardization.
- Sends formatted files to Google Cloud Storage.
Exploitation Zone
- Handles data exploitation, including analysis and transformation tasks.
- Sends data to BigQuery and connects to Google Looker for further analysis and visualization.
Landing Zone
- Manages the initial data landing, ingestion, and raw data handling.

How to Use

Setting Up: Ensure all dependencies are installed and the environment is configured properly.
Executing DAGs: The DAGs can be executed via the Airflow scheduler. Ensure the Airflow server is running and the DAGs are enabled in the Airflow UI.
Monitoring: Monitor the execution of the DAGs through the Airflow UI for any errors or required interventions.

Name		Name	Last commit message	Last commit date
Latest commit History 360 Commits
.github/workflows		.github/workflows
.idea		.idea
Time_Series		Time_Series
Website		Website
dags		dags
data		data
exploitation_zone		exploitation_zone
formatted_zone		formatted_zone
images		images
landing_zone		landing_zone
sentiment_analysis		sentiment_analysis
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
config.ini		config.ini
config.json		config.json
convert_to_parquet.py		convert_to_parquet.py
docker-compose.yml		docker-compose.yml
gcs_config.json		gcs_config.json
list_files.py		list_files.py
minio_bucket.py		minio_bucket.py
minio_gcs_test.py		minio_gcs_test.py
minio_uploader.py		minio_uploader.py
ocr_config.json		ocr_config.json
read_file_gcs.py		read_file_gcs.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpicyBytes P1 and P2 Delivery

P1: Landing Zone

Introduction

Structure of Data in the repository

Data Sources

How to run the code

High Level Data Architecture

P2: Formatted Zone and Exploitation Zone

DAGs

How to Use

About

Releases

Packages

Languages

1-ARIjitS/SpicyBytes

Folders and files

Latest commit

History

Repository files navigation

SpicyBytes P1 and P2 Delivery

P1: Landing Zone

Introduction

Structure of Data in the repository

Data Sources

How to run the code

High Level Data Architecture

P2: Formatted Zone and Exploitation Zone

DAGs

How to Use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages