Author: Amit Prakash Nema
A generic, multimodule framework for building scalable ETL (Extract, Transform, Load) processes using Apache Spark and Java. This framework is designed to be cloud-agnostic, maintainable, testable, and production-ready.
The framework follows enterprise design patterns and consists of the following modules:
- etl-core: Core framework components including configuration, validation, transformation interfaces, and I/O operations
- etl-jobs: Concrete job implementations and custom transformers
flowchart TD
A[Job Config] --> B[ETLEngine]
B --> C[Reader]
B --> D[Transformer]
B --> E[Writer]
B --> F[Validation]
B --> G[Monitoring]
B --> H[Cloud Integration]
C -->|Reads Data| D
D -->|Transforms Data| E
E -->|Writes Data| F
F -->|Validates| G
G -->|Monitors| H
- ✅ Cloud Agnostic: Supports AWS S3, Google Cloud Storage, Azure Blob Storage
- ✅ Modular Design: Separation of concerns with pluggable components
- ✅ Configuration-Driven: YAML-based job configuration with parameter substitution
- ✅ Data Validation: Built-in validation rules (NOT_NULL, UNIQUE, RANGE, REGEX)
- ✅ Multiple Data Sources: File systems, databases (JDBC), streaming sources
- ✅ Format Support: Parquet, JSON, CSV, Avro, ORC
- ✅ Transformation Framework: Abstract transformer pattern with custom implementations
- ✅ Error Handling: Comprehensive exception handling and logging
- ✅ Production Ready: Docker containerization, monitoring, and deployment scripts
- ✅ Testing Support: Unit and integration testing capabilities
spark-etl-framework/
├── etl-core/ # Core framework module
│ └── src/main/java/com/etl/core/
│ ├── config/ # Configuration management
│ ├── model/ # Data models and POJOs
│ ├── io/ # Data readers and writers
│ ├── validation/ # Data validation components
│ ├── transformation/ # Transformation interfaces
│ ├── factory/ # Factory pattern implementations
│ ├── exception/ # Custom exceptions
│ └── utils/ # Utility classes
├── etl-jobs/ # Job implementations
│ └── src/main/java/com/etl/jobs/
│ └── sample/ # Sample job implementation
├── dist/ # Distribution packages
│ └── sample-job/ # Ready-to-deploy job package
├── docker/ # Docker configuration
├── scripts/ # Build and deployment scripts
└── README.md # This file
- Factory Pattern: For creating readers, writers, and transformers
- Abstract Template Pattern: For transformation pipeline
- Strategy Pattern: For different I/O implementations
- Builder Pattern: For configuration objects
- Singleton Pattern: For configuration management
- Command Pattern: For job execution
- Java 11 (or higher)
- Maven 3.6+
- Docker (for containerized deployment)
mvn clean verify
This runs all unit/integration tests, code quality checks, coverage, and security scans.
All configuration values can be overridden via environment variables. Example:
export DATABASE_URL="jdbc:postgresql://prod-db:5432/etl_db"
export AWS_ACCESS_KEY="your-access-key"
export AWS_SECRET_KEY="your-secret-key"
See src/main/resources/application.properties
and etl-jobs/src/main/resources/jobs/sample-job/job-config.yaml
for
all available variables.
java -jar etl-jobs/target/etl-jobs-*.jar
Build and run the container:
docker build -t etl-framework:latest -f docker/Dockerfile .
docker run --rm -e DATABASE_URL="jdbc:postgresql://prod-db:5432/etl_db" etl-framework:latest
- Build, Test, Quality, Security: See
.github/workflows/maven.yml
- Docker Build & Publish: See
.github/workflows/docker.yml
- Supports AWS, GCP, Azure, and local deployments via environment variables.
- Containerized for Kubernetes, Docker Compose, or serverless platforms.
Extend the AbstractDataTransformer
class:
public class MyCustomTransformer extends AbstractDataTransformer {
@Override
protected Dataset<Row> doTransform(Dataset<Row> input, Map<String, Object> parameters) {
// Your transformation logic here
return input.filter(functions.col("status").equalTo("active"));
}
}
Implement the DataReader
interface:
public class MyCustomReader implements DataReader {
@Override
public Dataset<Row> read(InputConfig config) {
// Your data reading logic here
return dataset;
}
}
The framework uses a hierarchical configuration approach:
application.properties
- Application-wide settingsjob-config.yaml
- Job-specific configuration- Command-line parameters - Runtime overrides
cloud.provider=aws
aws.access.key=YOUR_ACCESS_KEY
aws.secret.key=YOUR_SECRET_KEY
aws.region=us-east-1
cloud.provider=gcp
gcp.project.id=your-project-id
gcp.service.account.key=/path/to/service-account.json
cloud.provider=azure
azure.storage.account=your-storage-account
azure.storage.access.key=your-access-key
# Build Docker image
cd docker/
docker-compose build
# Run with Docker Compose
docker-compose up -d
# Execute ETL job in container
docker-compose exec etl-framework ./sample-job/spark-submit.sh
The framework includes comprehensive logging using Logback:
- Console Output: Real-time job progress
- File Logging: Detailed logs with rotation
- Structured Logging: JSON format for log aggregation
- Metrics: Job execution metrics and validation results
# Run unit tests
mvn test
# Run integration tests
mvn verify -P integration-tests
Key configuration parameters in application.properties
:
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.adaptive.skewJoin.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
Adjust memory settings in spark-submit script:
--driver-memory 4g
--executor-memory 8g
--executor-cores 4
- Fork the repo and create a feature branch.
- Run
mvn clean verify
before submitting a PR. - Ensure new code is covered by tests and passes code quality checks.
Apache 2.0
For questions or support, open an issue or contact the maintainer
- Kafka streaming support
- Delta Lake integration
- Advanced data profiling
- ML pipeline integration
- REST API for job management
- Web UI for monitoring
Built with ❤️ using Apache Spark and Java