This project implements a real-time data processing pipeline using Apache Spark, Kafka, and Spark NLP. The pipeline ingests streaming data from Kafka, processes it using NLP techniques to extract meaningful insights, and stores the cleaned data in Parquet format for further analysis.
- Real-time Data Ingestion: Consumes streaming data from Kafka topics.
- Data Processing with Spark NLP:
- Cleans HTML content
- Extracts metadata (titles, paragraphs)
- Detects Named Entities (NER)
- Redacts sensitive information (PII)
- Batch Processing and Storage: Saves processed data in Parquet format for later use.
- Scalable Architecture: Built using Apache Spark, making it suitable for large-scale processing.
.
├── app.py # Entry point for running the pipeline
├── config.py # Configuration settings for Kafka & Spark
├── data_ingestion.py # Reads data from Kafka
├── data_processing.py # Cleans & extracts data using Spark NLP
├── pipeline.py # Orchestrates the end-to-end data pipeline
- Apache Spark (Structured Streaming)
- Spark NLP (Text Processing & NER)
- Apache Kafka (Streaming Data Source)
- PySpark
- Parquet (Data Storage)
git clone <repository_url>
cd <repository_name>
Ensure you have Python installed and set up a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
Ensure you have Kafka and Spark running. If using Docker, you can set up a docker-compose.yml
file to start the services.
Modify config.py
to update settings like Kafka topics, Spark master URL, and storage paths.
Example:
class Config:
KAFKA_BOOTSTRAP_SERVERS = "kafka:9092"
SPARK_MASTER_URL = "spark://spark-master:7077"
KAFKA_TOPIC = "data-pipeline"
STORAGE_PATH = "./clean_data.parquet"
python app.py
- Kafka Producer sends raw HTML content to a Kafka topic (
data-pipeline
). - Data Ingestion Module (
data_ingestion.py
):- Reads the Kafka stream.
- Extracts relevant fields (URL, HTML content).
- Data Processing Module (
data_processing.py
):- Cleans HTML content.
- Extracts metadata (titles, paragraphs).
- Applies Named Entity Recognition (NER) for sensitive data detection.
- Redacts PII.
- Pipeline Module (
pipeline.py
):- Streams processed data to a Parquet file (
clean_data.parquet
).
- Streams processed data to a Parquet file (
- Support for additional NLP models (e.g., sentiment analysis).
- Integration with a database for better query performance.
- UI dashboard for monitoring real-time data flow.