- Program Introduction: projects, pre-requisites, instructors, careers team, getting help..
- Introduction to Data Engineering
- What do data engineers do?
- NoSQL Database: Apache Cassandra
- OLAP vs. OLTP
- Normalization, Denormalization
- Fact and Dimension tables
- Star and Snowflake Schema
- Project 1: Data Modeling with Postgres Details
This project builds a star schema in Postgres with fact and dimension tables for analytics. A Python ETL pipeline transfers data from local JSON files into these tables. Apache Cassandra exercises: create tables, primary key, clustering column, where clause
- Apache Cassandra exercises 1-4: create tables, primary key, clustering column, where clause
- Project 2: Data Modeling with Apache Cassandra Details
This project involves designing and implementing an ETL pipeline to analyze music streaming data using Apache Cassandra. The goal is to transform raw event data stored in CSV files into a structured database optimized for specific query patterns.
- Data Warehouse
- Primary key, partition key, composite key, and clustering key
- Dimensional Modeling: Fact & Dimensions
- Exercise 1: 3NF to Star Shema
This exercise effectively demonstrates the practical process of converting a database from Third Normal Form (3NF) to a Star Schema. The key takeaway is the strategic denormalization of tables to optimize for analytical querying. By creating a central fact table (sales_fact) surrounded by descriptive dimension tables (dim_customer, dim_product, etc.), the resulting schema simplifies data retrieval and enhances performance for business intelligence tasks like sales analysis. This transformation highlights the fundamental difference between a schema designed for transactional efficiency (3NF) and one designed for analytical speed and simplicity (Star Schema).
- DWH Architecture: Kimball’s Bus Architecture, Independent Data Marts, Inmon’s Corporate Information Factory (CIF), Hybrid Kimball Bus & Inmon CIF
- Data marts
- OLAP Cubes and exercises: Roll-up, Drill-down, Slice, Dice, query optimization
- OLAP Cubes Technologies: MOLAP, ROLAP
- Exercise 3: Column format in ROLAP
- Create an IAM Role, Security Group, an IAM User, Bucket
- Lunch and delete a Redshift Cluster
- Create PostgreSQL Database
- Redshift technology, Architecture, ETL
- Exercise 1: Launch Redshift cluster
- Infrastructure as Code (IaC)
- Exercise 2: Creating Redshift Cluster using the AWS python SDK
- Exercise 3 Parallel ETL
- Optimizing Table Design
- Distribution Style: Even, All, Auto, Key
- Sorting Key
- Exercise 4: Table Design
- Project: Data Warehouse deatails
This project designs and implements a cloud-based Data Warehouse on Amazon Redshift. It demonstrates an end-to-end ETL pipeline that extracts song and user activity data from files stored in an Amazon S3 bucket, stages and processes it in Redshift, and transforms it into a set of optimized dimensional tables (a star schema).