Skip to content

trafilea/spark-ml-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spark ML Analytics Platform

Big Data + ML Platform powered by Apache Spark for large-scale analytics and machine learning. Inspired by Trafilea's recommender system architecture.

🎯 Overview

This template provides a complete Spark-based ML platform for processing massive datasets and building production ML models at scale.

πŸ”₯ What makes this different?

πŸ†š vs Other ML Templates:

  • πŸ“Š Big Data First - Designed for TBs of data, not GBs
  • ⚑ Spark Native - Leverages Spark's distributed computing power
  • πŸ”§ ETL + ML - Combined data engineering and machine learning
  • πŸ—οΈ Production Ready - EMR, Glue, Athena integration
  • πŸ’° Cost Optimized - Spot instances, auto-scaling, efficient resource usage

πŸ—οΈ Architecture

πŸ“Š Data Sources (Athena/S3/Redshift)
    ↓
πŸ”„ Spark ETL (EMR/Glue)
    ↓  
πŸ§ͺ Feature Engineering (Spark ML)
    ↓
πŸ€– Model Training (Spark MLlib/XGBoost)
    ↓
πŸ“ˆ Model Evaluation & Analytics
    ↓
πŸ’Ύ Model Registry (S3/MLflow)
    ↓
πŸš€ Batch Scoring & Insights

πŸ› οΈ AWS Services Stack

Component AWS Services Purpose
Compute EMR, Glue, EC2 Spot Distributed Spark processing
Storage S3 (Data Lake), EFS Raw data, features, models
Databases Athena, Redshift, Glue Catalog Data querying and metadata
Orchestration Step Functions, Airflow Workflow management
Monitoring CloudWatch, EMR Notebooks Observability and debugging
ML Spark MLlib, SageMaker (optional) Model training and inference

πŸš€ Quick Start

1. Infrastructure Setup

# Clone the repository
git clone https://github.com/trafilea/spark-ml-analytics.git
cd spark-ml-analytics

# Deploy infrastructure
make deploy-infra ENV=dev

# Check AWS configuration
make check-aws

2. Upload Sample Data

# Create sample dataset
mkdir -p data/raw
echo "user_id,product_id,rating,timestamp" > data/raw/interactions.csv
echo "user1,product1,4.5,2024-01-01" >> data/raw/interactions.csv

# Upload to S3
make upload-data ENV=dev

3. Run Spark Pipeline

# Run full ML pipeline
make pipeline ENV=dev

# Or run individual steps
make extract ENV=dev
make transform ENV=dev
make feature-eng ENV=dev
make train ENV=dev

πŸ“ Project Structure

spark-ml-analytics/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                    # CLI entry point
β”‚   β”œβ”€β”€ controllers/               # Configuration and pipeline controllers
β”‚   β”œβ”€β”€ etl/
β”‚   β”‚   β”œβ”€β”€ extraction/           # Data extraction from various sources
β”‚   β”‚   β”œβ”€β”€ transformation/       # Spark ETL processing
β”‚   β”‚   └── loading/              # Data loading utilities
β”‚   β”œβ”€β”€ analytics/
β”‚   β”‚   β”œβ”€β”€ feature_engineering/  # ML feature generation
β”‚   β”‚   β”œβ”€β”€ model_training/       # Spark MLlib training
β”‚   β”‚   β”œβ”€β”€ evaluation/           # Model evaluation
β”‚   β”‚   └── insights/             # Business analytics
β”‚   β”œβ”€β”€ connectors/               # AWS service connectors
β”‚   β”‚   β”œβ”€β”€ s3_connector.py
β”‚   β”‚   β”œβ”€β”€ athena_connector.py
β”‚   β”‚   β”œβ”€β”€ emr_connector.py
β”‚   β”‚   └── glue_connector.py
β”‚   └── utils/                    # Spark utilities and helpers
β”œβ”€β”€ infrastructure/               # Terraform infrastructure
β”‚   └── terraform/
β”‚       β”œβ”€β”€ modules/
β”‚       β”‚   β”œβ”€β”€ emr/             # EMR cluster configuration
β”‚       β”‚   β”œβ”€β”€ glue/            # Glue jobs and catalog
β”‚       β”‚   β”œβ”€β”€ s3/              # Data lake setup
β”‚       β”‚   └── athena/          # Athena workgroup
β”‚       └── environments/
β”œβ”€β”€ configs/                     # Environment configurations
β”‚   β”œβ”€β”€ dev.yaml
β”‚   β”œβ”€β”€ staging.yaml
β”‚   └── prod.yaml
β”œβ”€β”€ notebooks/                   # Jupyter/Zeppelin notebooks
β”œβ”€β”€ scripts/                     # Utility scripts
β”œβ”€β”€ tests/                       # Test suite
└── data/                        # Local data directory

βš™οΈ Configuration

Environment Configuration

Each environment has its own YAML configuration in configs/:

# configs/dev.yaml
environment: dev
aws:
  region: us-east-1
  s3_bucket: "trafilea-spark-ml-dev"

spark:
  app_name: "TrafileaSparkMLAnalytics-Dev"
  master: "local[*]"  # or "yarn" for EMR
  driver_memory: "2g"
  executor_memory: "2g"

cluster:
  type: "local"  # local, emr, glue
  emr:
    instance_type: "m5.xlarge"
    instance_count: 3

Spark Configuration

Optimized Spark settings for different workloads:

spark:
  config:
    "spark.sql.adaptive.enabled": "true"
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
    "spark.sql.adaptive.skewJoin.enabled": "true"
    "spark.dynamicAllocation.enabled": "true"
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension"

πŸ”§ Usage Examples

ETL Processing

# Extract data from Athena
python src/main.py extract \
    --env dev \
    --execution-datetime "2024-01-01 10:00:00" \
    --data-source athena

# Transform with Spark
python src/main.py transform \
    --env dev \
    --execution-datetime "2024-01-01 10:00:00" \
    --job-type etl \
    --cluster-mode emr

Feature Engineering

# Generate user features
python src/main.py feature-engineering \
    --env dev \
    --execution-datetime "2024-01-01 10:00:00" \
    --feature-type user

# Generate item features  
python src/main.py feature-engineering \
    --env dev \
    --execution-datetime "2024-01-01 10:00:00" \
    --feature-type item

Model Training

# Train collaborative filtering model
python src/main.py train \
    --env dev \
    --execution-datetime "2024-01-01 10:00:00" \
    --model-type collaborative_filtering \
    --framework spark_mllib

# Train XGBoost model
python src/main.py train \
    --env dev \
    --execution-datetime "2024-01-01 10:00:00" \
    --model-type xgboost \
    --framework spark_mllib

Analytics

# Run descriptive analytics
python src/main.py analytics \
    --env dev \
    --execution-datetime "2024-01-01 10:00:00" \
    --analysis-type descriptive

πŸ€– ML Algorithms Supported

Collaborative Filtering

  • Spark ALS (Alternating Least Squares)
  • Matrix Factorization
  • Implicit Feedback models

Content-Based

  • Random Forest with item features
  • Gradient Boosting with user/item features
  • Feature-based similarity models

Ensemble Methods

  • XGBoost (via Spark ML)
  • LightGBM integration
  • Model Stacking

Time Series

  • Trend Analysis
  • Seasonality Detection
  • Forecasting Models

πŸ“Š Analytics Capabilities

User Analytics

  • RFM Analysis (Recency, Frequency, Monetary)
  • User Segmentation
  • Behavioral Patterns
  • Churn Prediction

Product Analytics

  • Product Performance
  • Cross-sell Analysis
  • Inventory Optimization
  • Price Elasticity

Business Intelligence

  • Revenue Analytics
  • Conversion Funnels
  • A/B Test Analysis
  • Market Basket Analysis

πŸš€ Deployment

Local Development

# Start local Spark
make transform ENV=dev

# Run Jupyter notebooks
make start-jupyter

EMR Cluster

# Deploy EMR cluster
make start-emr ENV=prod

# Submit Spark job
make submit-spark ENV=prod

# Monitor via Spark UI
make spark-ui ENV=prod

Glue Jobs

# Deploy Glue infrastructure
make deploy-infra ENV=prod

# Run Glue ETL job
aws glue start-job-run --job-name spark-ml-analytics-prod

πŸ“ˆ Monitoring & Observability

EMR Monitoring

  • Spark History Server - Job execution history
  • YARN ResourceManager - Cluster resource usage
  • CloudWatch Metrics - System and application metrics

Data Quality

  • Data Validation checks
  • Schema Evolution tracking
  • Data Lineage documentation

Performance Optimization

  • Query Performance insights
  • Resource Utilization monitoring
  • Cost Optimization recommendations

πŸ”— Integration Patterns

Airflow Integration

# DAG for daily ML pipeline
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr_containers import EMRContainerOperator

dag = DAG('spark_ml_pipeline', schedule_interval='@daily')

extract_task = EMRContainerOperator(
    task_id='extract_data',
    job_driver={
        "sparkSubmitJobDriver": {
            "entryPoint": "s3://bucket/src/main.py",
            "entryPointArguments": ["extract", "--env", "prod"]
        }
    }
)

Real-time Integration

# Stream processing with Kinesis
spark.readStream \
    .format("kinesis") \
    .option("streamName", "user-events") \
    .load() \
    .writeStream \
    .foreachBatch(process_batch) \
    .start()

πŸ’‘ Use Cases

πŸ›’ E-commerce Recommendations

  • Product Recommendations at scale
  • Cross-sell & Upsell optimization
  • Personalized Marketing campaigns

πŸ“Š Financial Analytics

  • Fraud Detection models
  • Risk Assessment analytics
  • Customer Lifetime Value modeling

🏭 IoT & Manufacturing

  • Predictive Maintenance
  • Quality Control analytics
  • Supply Chain optimization

πŸ“± Digital Marketing

  • Customer Segmentation
  • Attribution Modeling
  • Marketing Mix optimization

πŸ§ͺ Testing

# Run all tests
make test

# Spark-specific tests
pytest tests/spark/ -v

# Integration tests with EMR
pytest tests/integration/ -v

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open a Pull Request

πŸ“ License

Copyright Β© 2024 Trafilea

πŸ†˜ Support


πŸŽ‰ Ready to process TBs of data with Spark!

πŸ”„ When to use this template:

βœ… Processing > 100GB datasets
βœ… Complex feature engineering
βœ… Distributed ML training
βœ… Real-time + Batch analytics
βœ… Cost-sensitive big data workloads

About

Big Data ML Analytics Platform powered by Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published