Spark ETL Framework

Author: Amit Prakash Nema

A generic, multimodule framework for building scalable ETL (Extract, Transform, Load) processes using Apache Spark and Java. This framework is designed to be cloud-agnostic, maintainable, testable, and production-ready.

Architecture Overview

The framework follows enterprise design patterns and consists of the following modules:

etl-core: Core framework components including configuration, validation, transformation interfaces, and I/O operations
etl-jobs: Concrete job implementations and custom transformers

Architecture Diagram

flowchart TD
    A[Job Config] --> B[ETLEngine]
    B --> C[Reader]
    B --> D[Transformer]
    B --> E[Writer]
    B --> F[Validation]
    B --> G[Monitoring]
    B --> H[Cloud Integration]
    C -->|Reads Data| D
    D -->|Transforms Data| E
    E -->|Writes Data| F
    F -->|Validates| G
    G -->|Monitors| H

Key Features

✅ Cloud Agnostic: Supports AWS S3, Google Cloud Storage, Azure Blob Storage
✅ Modular Design: Separation of concerns with pluggable components
✅ Configuration-Driven: YAML-based job configuration with parameter substitution
✅ Data Validation: Built-in validation rules (NOT_NULL, UNIQUE, RANGE, REGEX)
✅ Multiple Data Sources: File systems, databases (JDBC), streaming sources
✅ Format Support: Parquet, JSON, CSV, Avro, ORC
✅ Transformation Framework: Abstract transformer pattern with custom implementations
✅ Error Handling: Comprehensive exception handling and logging
✅ Production Ready: Docker containerization, monitoring, and deployment scripts
✅ Testing Support: Unit and integration testing capabilities

Project Structure

spark-etl-framework/
├── etl-core/                    # Core framework module
│   └── src/main/java/com/etl/core/
│       ├── config/              # Configuration management
│       ├── model/               # Data models and POJOs
│       ├── io/                  # Data readers and writers
│       ├── validation/          # Data validation components
│       ├── transformation/      # Transformation interfaces
│       ├── factory/             # Factory pattern implementations
│       ├── exception/           # Custom exceptions
│       └── utils/               # Utility classes
├── etl-jobs/                    # Job implementations
│   └── src/main/java/com/etl/jobs/
│       └── sample/              # Sample job implementation
├── dist/                        # Distribution packages
│   └── sample-job/              # Ready-to-deploy job package
├── docker/                      # Docker configuration
├── scripts/                     # Build and deployment scripts
└── README.md                    # This file

Design Patterns Used

Factory Pattern: For creating readers, writers, and transformers
Abstract Template Pattern: For transformation pipeline
Strategy Pattern: For different I/O implementations
Builder Pattern: For configuration objects
Singleton Pattern: For configuration management
Command Pattern: For job execution

Getting Started

Prerequisites

Java 11 (or higher)
Maven 3.6+
Docker (for containerized deployment)

Build & Test

mvn clean verify

This runs all unit/integration tests, code quality checks, coverage, and security scans.

Configuration

All configuration values can be overridden via environment variables. Example:

export DATABASE_URL="jdbc:postgresql://prod-db:5432/etl_db"
export AWS_ACCESS_KEY="your-access-key"
export AWS_SECRET_KEY="your-secret-key"

See src/main/resources/application.properties and etl-jobs/src/main/resources/jobs/sample-job/job-config.yaml for all available variables.

Running Locally

java -jar etl-jobs/target/etl-jobs-*.jar

Running with Docker

Build and run the container:

docker build -t etl-framework:latest -f docker/Dockerfile .
docker run --rm -e DATABASE_URL="jdbc:postgresql://prod-db:5432/etl_db" etl-framework:latest

CI/CD Workflows

Build, Test, Quality, Security: See .github/workflows/maven.yml
Docker Build & Publish: See .github/workflows/docker.yml

Cloud Agnostic Deployment

Supports AWS, GCP, Azure, and local deployments via environment variables.
Containerized for Kubernetes, Docker Compose, or serverless platforms.

Development Guide

Creating Custom Transformers

Extend the AbstractDataTransformer class:

public class MyCustomTransformer extends AbstractDataTransformer {
    @Override
    protected Dataset<Row> doTransform(Dataset<Row> input, Map<String, Object> parameters) {
        // Your transformation logic here
        return input.filter(functions.col("status").equalTo("active"));
    }
}

Adding New Data Sources

Implement the DataReader interface:

public class MyCustomReader implements DataReader {
    @Override
    public Dataset<Row> read(InputConfig config) {
        // Your data reading logic here
        return dataset;
    }
}

Configuration Management

The framework uses a hierarchical configuration approach:

application.properties - Application-wide settings
job-config.yaml - Job-specific configuration
Command-line parameters - Runtime overrides

Cloud Deployment

AWS Configuration

cloud.provider=aws
aws.access.key=YOUR_ACCESS_KEY
aws.secret.key=YOUR_SECRET_KEY
aws.region=us-east-1

GCP Configuration

cloud.provider=gcp
gcp.project.id=your-project-id
gcp.service.account.key=/path/to/service-account.json

Azure Configuration

cloud.provider=azure
azure.storage.account=your-storage-account
azure.storage.access.key=your-access-key

Docker Deployment

# Build Docker image
cd docker/
docker-compose build

# Run with Docker Compose
docker-compose up -d

# Execute ETL job in container
docker-compose exec etl-framework ./sample-job/spark-submit.sh

Monitoring and Logging

The framework includes comprehensive logging using Logback:

Console Output: Real-time job progress
File Logging: Detailed logs with rotation
Structured Logging: JSON format for log aggregation
Metrics: Job execution metrics and validation results

Testing

# Run unit tests
mvn test

# Run integration tests
mvn verify -P integration-tests

Performance Tuning

Spark Configuration

Key configuration parameters in application.properties:

spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer

Memory Settings

Adjust memory settings in spark-submit script:

--driver-memory 4g
--executor-memory 8g
--executor-cores 4

Contributing

Fork the repo and create a feature branch.
Run mvn clean verify before submitting a PR.
Ensure new code is covered by tests and passes code quality checks.

License

Apache 2.0

Contact

For questions or support, open an issue or contact the maintainer

Roadmap

Built with ❤️ using Apache Spark and Java

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
dist/sample-job		dist/sample-job
docker		docker
etl-core		etl-core
etl-jobs		etl-jobs
scripts		scripts
src/main/resources		src/main/resources
LICENSE		LICENSE
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

License

amitnema/spark-etl-framework

Folders and files

Latest commit

History

Repository files navigation

Spark ETL Framework

Architecture Overview

Architecture Diagram

Key Features

Project Structure

Design Patterns Used

Getting Started

Prerequisites

Build & Test

Configuration

Running Locally

Running with Docker

CI/CD Workflows

Cloud Agnostic Deployment

Development Guide

Creating Custom Transformers

Adding New Data Sources

Configuration Management

Cloud Deployment

AWS Configuration

GCP Configuration

Azure Configuration

Docker Deployment

Monitoring and Logging

Testing

Performance Tuning

Spark Configuration

Memory Settings

Contributing

License

Contact

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages