Skip to content

achrafS133/Arabic_News_Data_Engineering_project

Repository files navigation

Arabic News Data Engineering Project

Author: ACHRAF ER-RAHOUTI

This project implements a scalable ETL pipeline to extract and process Arabic news content from over 9 sources including Al Jazeera, BBC Arabic, CNN Arabic, and RT Arabic via RapidAPI. It leverages a modern data engineering stack to load processed data into Google BigQuery, utilizing LLMs for advanced content analysis.

Architecture

System Architecture

Technologies

Component Technology Stack
Data Ingestion Kafka Cluster (KRaft mode), RapidAPI
Data Processing PySpark, Spark Cluster
NLP & AI Llama 3.1-8B + Custom Function Calling via GROQ API (Topic Classification, Geolocation, Entity Recognition)
Data Warehouse Google BigQuery
Orchestration Apache Airflow
Monitoring Prometheus, Grafana, Spark UI, Airflow UI
Infrastructure Docker, Docker Compose

Quick Start (Lightweight Stack)

We recommend using the Lightweight Stack for development, which optimizes resource usage (RAM ~4GB) compared to the full production stack.

Prerequisites

  • Docker Desktop & Docker Compose
  • API Keys:
  • GCP Service Account Key (for BigQuery)

1. Setup Environment

  1. Clone the repository:

    git clone https://github.com/achrafS133/Arabic_News_Data_Engineering_project.git
    cd Arabic_News_Data_Engineering_project
  2. Configure Environment Variables: Update .env (or .env.dev) with your credentials:

    AIRFLOW__SMTP__SMTP_HOST=your_smtp_host
    AIRFLOW__SMTP__SMTP_USER=your_smtp_user
    AIRFLOW__EMAIL__FROM_EMAIL=your_email
    X_RAPID_API_KEY=your_rapidapi_key
    GROQ_API_KEY=your_groq_key
  3. Google Cloud Auth:

    • Save your GCP Service Account key as keyfile.json in ./kafka/config/.
    • Ensure the service account has BigQuery permissions.

2. Start Services

Option 1: Windows (Automated Source)

.\scripts\migrate-to-lightweight.ps1

Option 2: Manual (Universal)

docker-compose -f compose-dev.yaml --profile dev up -d

Use --profile dev-monitor to include Prometheus & Grafana.

3. Access Services

Service URL Credentials
Airflow UI http://localhost:8086 airflow_admin / airflow_admin
Spark Master http://localhost:8084 -
Spark Worker http://localhost:8085 -
Grafana http://localhost:3001 admin / admin (if enabled)
Prometheus http://localhost:9090 - (if enabled)

Run the Pipeline

  1. Open Airflow UI at http://localhost:8086.
  2. Locate the DAG news_fetching_and_processing.
  3. Toggle the DAG On and click the Play button to trigger manually.

Profiles & Resource Usage

Profile Command Flag Containers RAM Usage Description
Dev --profile dev 10 ~4GB Core ETL components only. Best for local dev.
Monitor --profile dev-monitor 13 ~5GB Adds Prometheus & Grafana.
Full --profile all 21 ~7GB Full production simulation (HA Kafka, etc).

To switch profiles (e.g., to full production stack):

docker-compose -f compose-dev.yaml --profile all up -d

Documentation

  • Lightweight Guide: Detailed breakdown of the optimized stack and resource limits.
  • Setup Guide: Comprehensive installation and troubleshooting instructions.
  • Quick Start: Step-by-step generic startup guide.

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors