Author: ACHRAF ER-RAHOUTI
This project implements a scalable ETL pipeline to extract and process Arabic news content from over 9 sources including Al Jazeera, BBC Arabic, CNN Arabic, and RT Arabic via RapidAPI. It leverages a modern data engineering stack to load processed data into Google BigQuery, utilizing LLMs for advanced content analysis.
| Component | Technology Stack |
|---|---|
| Data Ingestion | Kafka Cluster (KRaft mode), RapidAPI |
| Data Processing | PySpark, Spark Cluster |
| NLP & AI | Llama 3.1-8B + Custom Function Calling via GROQ API (Topic Classification, Geolocation, Entity Recognition) |
| Data Warehouse | Google BigQuery |
| Orchestration | Apache Airflow |
| Monitoring | Prometheus, Grafana, Spark UI, Airflow UI |
| Infrastructure | Docker, Docker Compose |
We recommend using the Lightweight Stack for development, which optimizes resource usage (RAM ~4GB) compared to the full production stack.
-
Clone the repository:
git clone https://github.com/achrafS133/Arabic_News_Data_Engineering_project.git cd Arabic_News_Data_Engineering_project -
Configure Environment Variables: Update
.env(or.env.dev) with your credentials:AIRFLOW__SMTP__SMTP_HOST=your_smtp_host AIRFLOW__SMTP__SMTP_USER=your_smtp_user AIRFLOW__EMAIL__FROM_EMAIL=your_email X_RAPID_API_KEY=your_rapidapi_key GROQ_API_KEY=your_groq_key
-
Google Cloud Auth:
- Save your GCP Service Account key as
keyfile.jsonin./kafka/config/. - Ensure the service account has BigQuery permissions.
- Save your GCP Service Account key as
Option 1: Windows (Automated Source)
.\scripts\migrate-to-lightweight.ps1Option 2: Manual (Universal)
docker-compose -f compose-dev.yaml --profile dev up -dUse --profile dev-monitor to include Prometheus & Grafana.
| Service | URL | Credentials |
|---|---|---|
| Airflow UI | http://localhost:8086 | airflow_admin / airflow_admin |
| Spark Master | http://localhost:8084 | - |
| Spark Worker | http://localhost:8085 | - |
| Grafana | http://localhost:3001 | admin / admin (if enabled) |
| Prometheus | http://localhost:9090 | - (if enabled) |
- Open Airflow UI at http://localhost:8086.
- Locate the DAG
news_fetching_and_processing. - Toggle the DAG On and click the Play button to trigger manually.
| Profile | Command Flag | Containers | RAM Usage | Description |
|---|---|---|---|---|
| Dev | --profile dev |
10 | ~4GB | Core ETL components only. Best for local dev. |
| Monitor | --profile dev-monitor |
13 | ~5GB | Adds Prometheus & Grafana. |
| Full | --profile all |
21 | ~7GB | Full production simulation (HA Kafka, etc). |
To switch profiles (e.g., to full production stack):
docker-compose -f compose-dev.yaml --profile all up -d- Lightweight Guide: Detailed breakdown of the optimized stack and resource limits.
- Setup Guide: Comprehensive installation and troubleshooting instructions.
- Quick Start: Step-by-step generic startup guide.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
