Capstone project created for Promineo Tech's Data Engineering program. Utilized AWS tools and services to create a pipeline to move, clean, store, and analyze data that was scraped and created using Python.
Formed necessary steps to convey the various steps in the business process
Created visual models using Visual Paradigm
Created MS SQL Server database instance in AWS RDS
Connected to RDS instance using DBeaver
Used DDL script to create tables within database instance
Utilized BeautifulSoup, Faker, and ChatGPT API to generate data then stored in CSV file using Boto3
Employed the DBeaver Import Wizard to store CSV files into their respective tables
Data lake created using AWS Lake Formation, an Amazon S3 bucket, and Glue data catalog hosted on administrator account
Used Glue ETL Job to transfer raw data from RDS instance to S3 data lake, filling Glue catalog
Inspected data in Athena to check the data quality
Used Glue ETL Job to correct data issues and moved clean data into data lake
Created a Glue crawler to get updated data into Glue catalog
Set up AWS Redshift cluster
Imported updated Glue catalog, creating and populating Redshift cluster schema
Used Visual Paradigm to visualize data from Redshift cluster