Develop a streaming data pipeline that retrieves data from an API in the English language. Produce this data in real-time to an Apache Kafka topic. Finally, build a Spark Streaming application to consume the records from Kafka and calculate the number of words in each record in real-time.
Apache Kafka | Confluent | Python | AWS S3 |
This repository contains a real-time data analytics pipeline built using Apache Kafka.
-
Apache Kafka: A distributed event streaming platform used for building real-time data pipelines and streaming applications.
-
Virtualenv: A tool to create isolated Python environments.
-
Confluent Apache Kafka Cloud Provider: A Cloud Provider for Apache Kafka provided by Confluent.
Before running the Kafka producers and consumers, you need to set up your Confluent environment.
-
Create a virtual environment and activate it:
virtualenv env source env/bin/activate
-
Install the Confluent Kafka Python client:
pip install confluent-kafka
-
Configure your
file.ini
with your Confluent Cloud API keys and cluster settings:[default] bootstrap.servers=<BOOTSTRAP SERVER> security.protocol=SASL_SSL sasl.mechanisms=PLAIN sasl.username=<CLUSTER API KEY> sasl.password=<CLUSTER API SECRET> [consumer] group.id=python_example_group_1 auto.offset.reset=earliest
To run the Kafka producers and consumers, follow these steps:
-
Make the producer script executable:
chmod u+x producer.py
-
Run the producer script with your
file.ini
:./producer.py file.ini
-
Make the consumer script executable:
chmod u+x consumer.py
-
Run the consumer script with your
file.ini
:./consumer.py file.ini
Make sure to run producers and consumers in separate git bash instances.
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to the Apache Kafka and Confluent communities for their excellent tools and documentation.