Real-time Data Analytics Pipeline

Develop a streaming data pipeline that retrieves data from an API in the English language. Produce this data in real-time to an Apache Kafka topic. Finally, build a Spark Streaming application to consume the records from Kafka and calculate the number of words in each record in real-time.


Apache Kafka	Confluent	Python	AWS S3

This repository contains a real-time data analytics pipeline built using Apache Kafka.

Technologies Used

Apache Kafka: A distributed event streaming platform used for building real-time data pipelines and streaming applications.
Virtualenv: A tool to create isolated Python environments.
Confluent Apache Kafka Cloud Provider: A Cloud Provider for Apache Kafka provided by Confluent.

Confluent Setup

Before running the Kafka producers and consumers, you need to set up your Confluent environment.

Create a virtual environment and activate it:
```
virtualenv env
source env/bin/activate
```
Install the Confluent Kafka Python client:
```
pip install confluent-kafka
```

Configure your file.ini with your Confluent Cloud API keys and cluster settings:

[default]
bootstrap.servers=<BOOTSTRAP SERVER>
security.protocol=SASL_SSL
sasl.mechanisms=PLAIN
sasl.username=<CLUSTER API KEY>
sasl.password=<CLUSTER API SECRET>

[consumer]
group.id=python_example_group_1

auto.offset.reset=earliest

How to Run

To run the Kafka producers and consumers, follow these steps:

For Producer

Make the producer script executable:
```
chmod u+x producer.py
```
Run the producer script with your file.ini:
```
./producer.py file.ini
```

For Consumer

Make the consumer script executable:
```
chmod u+x consumer.py
```
Run the consumer script with your file.ini:
```
./consumer.py file.ini
```

Make sure to run producers and consumers in separate git bash instances.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Thanks to the Apache Kafka and Confluent communities for their excellent tools and documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
images		images
kafka_integration		kafka_integration
spark_session		spark_session
terraform_configurations		terraform_configurations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time Data Analytics Pipeline

Technologies Used

Confluent Setup

How to Run

For Producer

For Consumer

License

Acknowledgments

About

Releases

Packages

Languages

License

gradedSystem/Real-time-Data-Analytics-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Real-time Data Analytics Pipeline

Technologies Used

Confluent Setup

How to Run

For Producer

For Consumer

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages