Welcome to my repository for the DeepLearning.AI's Data Engineering Professional Certificate! This repo contains code, quizzes, and personal notes from the specialization, showcasing my journey in mastering data engineering concepts and tools.
The Data Engineering Specialization is a comprehensive program designed to equip learners with the skills needed to design, build, and manage data pipelines and architectures. This repository documents my hands-on experience with the course material.
- Key Topics:
- Data engineering lifecycle and undercurrents
- Designing data architectures on AWS
- Implementing batch and streaming pipelines
- Content:
- Notes on requirements gathering and stakeholder collaboration
- Code samples for batch and streaming pipelines
- Architecture diagrams and design considerations
- Key Topics:
- Working with source systems (relational and NoSQL databases)
- Data ingestion techniques (batch and streaming)
- DataOps practices (CI/CD, Infrastructure as Code, data quality)
- Content:
- Scripts for data ingestion from APIs and message queues
- Terraform configurations for AWS resources
- Airflow DAGs for orchestrating data pipelines
- Data quality tests using Great Expectations
- Key Topics:
- Storage systems (object, block, file storage)
- Data lake and data warehouse architectures
- Query optimization and performance tuning
- Content:
- Implementations of data lakehouse architectures
- Advanced SQL queries and performance comparisons
- Notes on storage formats and indexing strategies
- Key Topics:
- Data modeling techniques (normalization, star schema, data vault)
- Transformations for analytics and machine learning
- Batch and streaming data processing
- Content:
- Data models and schemas for different use cases
- PySpark code for data transformations
- Preprocessing pipelines for machine learning datasets
- Data Architecture Design
- Data Ingestion Techniques
- DataOps Practices
- Data Storage and Retrieval
- Data Modeling
- Data Transformation and Orchestration
- Programming Languages: Python, SQL
- Cloud Platforms: AWS
- Data Processing Frameworks: Apache Spark, PySpark, Pandas
- Orchestration Tools: Apache Airflow
- Infrastructure as Code: Terraform
- Data Quality Tools: Great Expectations
- Databases: MySQL, PostgreSQL, MongoDB, Amazon S3
- Others: REST APIs, Message Queues, Streaming Platforms
This project is licensed under the MIT License - see the LICENSE file for details.
Feel free to reach out via LinkedIn or email for any questions or collaborations!