Sure, here's the complete README.md for the "Data Ingestion Pipelines" project:
This project demonstrates the implementation of two different data ingestion approaches: batch data migration using Apache Sqoop and real-time data ingestion using Apache Flume and Apache Kafka.
This task focuses on migrating historical data from a local relational database to Hadoop's HDFS using Apache Sqoop. It includes periodic updates with incremental imports to handle new data additions.
- Relational database (MariaDB)
- Apache Hadoop (HDFS)
- Apache Sqoop
- Set up the relational database and create a sample table with historical data.
- Configure Apache Hadoop and HDFS on your system.
- Install and configure Apache Sqoop to connect to the relational database and HDFS.
- Perform the initial full data import from the database to HDFS using Sqoop.
- Implement a script or job to perform periodic incremental imports, capturing new data additions in the database.
- Verify the data in HDFS and compare it with the source database.
- Successful migration of historical data from the relational database to HDFS using Apache Sqoop.
- Implemented periodic incremental data imports to keep the HDFS data up-to-date.
- Validated the data integrity between the source database and the HDFS destination.
This task focuses on setting up a real-time data ingestion pipeline using Apache Flume and Apache Kafka. It involves collecting log data from a local directory and streaming it to Kafka for real-time processing.
- Apache Hadoop (HDFS)
- Apache Flume
- Apache Kafka
- Python
- Set up Apache Hadoop and HDFS on your system.
- Write a Python Script to generate log file data.
- Install and configure Apache Flume to collect log data from a local directory.
- Set up Apache Kafka and create a Kafka topic to receive the log data.
- Configure Flume to use Kafka as the destination for the collected log data.
- Test the real-time data ingestion pipeline by generating sample log data in the local directory and verifying the data in the Kafka topic.
- Explore options to consume the data from the Kafka topic for real-time processing or further downstream analysis.
- Successful setup of the real-time data ingestion pipeline using Apache Flume and Apache Kafka.
- Collected log data from a local directory and streamed it to a Kafka topic in real-time.
- Demonstrated the ability to consume the data from the Kafka topic for real-time processing or further analysis.
This project showcases two distinct data ingestion approaches: batch data migration using Apache Sqoop and real-time data ingestion using Apache Flume and Apache Kafka. By completing these tasks, you will gain hands-on experience in setting up and managing data ingestion pipelines, which are essential for building robust and scalable data processing systems.