This is a code repo dedicated to following a data engineering bootcamp. As we make progress on the course, I am adding my thoughts, approach and solutions.
- Docker, Docker Hub
- Terraform
- Python
- Git, Github, Github Codespace
- Google Cloud
More...
- Course overview
- Introduction to GCP
- Docker and docker-compose
- Running Postgres locally with Docker
- Setting up infrastructure on GCP with Terraform
- Preparing the environment for the course
- Homework
- Data Lake
- Workflow orchestration
- Introduction to Prefect
- ETL with GCP & Prefect
- Parametrizing workflows
- Prefect Cloud and additional resources
- Homework
- BigQuery
- Partitioning and clustering
- BigQuery best practices
- Internals of BigQuery
- Integrating BigQuery with Prefect and AirFlow
- BigQuery Machine Learning
- Basics of analytics engineering
- dbt (data build tool)
- BigQuery and dbt
- dbt models
- Testing and documenting
- Deployment to the cloud and locally
- Visualizing the data with google data studio and metabase
- Data Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins
- Kafka Actors
- Topic
- Consumer
- Producer
- Streams vs State
- Aggregates
- Streaming with Spark