A Proof of Concept (POC) demonstrating the use of AWS Lambda with DuckDB as an alternative to traditional Spark-based data pipelines.
This project implements a serverless ETL pipeline using AWS Lambda and DuckDB to process data files stored in S3. It showcases how lightweight, cost-effective serverless functions can replace heavier Spark infrastructure for certain data processing workloads.
The solution follows a lakehouse architecture pattern with three distinct data zones:
S3 Raw Bucket → Lambda (DuckDB) → S3 Silver/Gold Buckets
- AWS Lambda: Serverless compute running containerized DuckDB processing
- DuckDB: Embedded analytical database for data transformation
- Amazon S3: Three-tier lakehouse storage (raw, silver, gold)
- Amazon ECR: Container registry for Lambda Docker images
- Amazon CloudWatch: Logging and monitoring
- Data files land in the
rawS3 bucket - S3 event notification triggers the Lambda function
- Lambda processes data using DuckDB
- Transformed data is written to
silverorgoldbuckets
The infrastructure is defined using Terraform and consists of the following components:
Three S3 buckets implementing a lakehouse pattern:
- raw-{account_id}: Landing zone for raw ingested data
- silver-{account_id}: Cleaned and validated data
- gold-{account_id}: Business-ready aggregated data
Each bucket includes:
- Private ACL for security
- Server-side encryption (AES256)
- S3 event notifications (raw bucket triggers Lambda)
- Function Name:
duckdb_ingestion(configurable) - Package Type: Container image from ECR
- Trigger: S3 ObjectCreated events on raw bucket
- Runtime: Python 3.12 with DuckDB
- Handler:
main.lambda_handler
- ECR Repository:
lambda_duckdb - Automated Build: Terraform triggers Docker build and push
- Rebuild Triggers: Changes to Dockerfile, main.py, or requirements.txt
- Security: Image scanning enabled on push
Lambda execution role with permissions for:
- S3 read/write access to buckets
- CloudWatch Logs for monitoring
- EC2 network interfaces (VPC if needed)
Dependencies (requirements.txt:3-22):
boto3: AWS SDK for Pythonduckdb: Embedded analytical database- Supporting libraries (botocore, s3transfer, etc.)
Lambda Handler (main.py:9-25):
- Extracts S3 bucket and key from event
- Initializes DuckDB connection
- Configures S3 access via httpfs extension
- Uses AWS credential chain for authentication
- AWS CLI configured with appropriate credentials
- Terraform >= 1.0
- Docker installed and running
- AWS account with permissions to create:
- Lambda functions
- S3 buckets
- ECR repositories
- IAM roles and policies
| Variable | Default | Description |
|---|---|---|
aws_region |
us-east-1 |
AWS region for resource deployment |
lambda_name |
duckdb_ingestion |
Name of the Lambda function |
- Navigate to the infrastructure directory:
cd infra- Initialize Terraform:
terraform init- Review the planned changes:
terraform plan- Deploy the infrastructure:
terraform applyThe deployment process will:
- Create S3 buckets for raw, silver, and gold zones
- Set up ECR repository
- Build and push Docker image to ECR
- Deploy Lambda function with S3 trigger
- Configure IAM roles and permissions
Once deployed, the pipeline activates automatically:
- Upload a data file to the raw bucket:
aws s3 cp data.csv s3://raw-{account_id}/- Lambda function is triggered automatically
- View logs in CloudWatch:
aws logs tail /aws/lambda/duckdb_ingestion --followThis architecture offers several cost advantages over Spark:
- No Idle Resources: Lambda charges only for execution time
- No Cluster Management: Eliminates EMR/Glue/Databricks costs
- Efficient Processing: DuckDB's columnar engine is highly optimized
- Scalability: Automatic scaling without pre-provisioning
Consider these limitations when using Lambda + DuckDB:
- Execution Time: 15-minute maximum Lambda timeout
- Memory: Up to 10GB Lambda memory limit
- Storage: 10GB ephemeral storage in /tmp
- Best For: Small to medium datasets (<1-2GB)
| Aspect | Lambda + DuckDB | Spark |
|---|---|---|
| Setup Time | Seconds | Minutes |
| Cost Model | Pay-per-execution | Pay-per-cluster-hour |
| Cluster Management | None | Required |
| Data Size | Small-Medium (<2GB) | Any size |
| Processing Speed | Fast for small data | Fast for large data |
.
├── infra/
│ ├── app/
│ │ ├── Dockerfile # Lambda container definition
│ │ ├── main.py # Lambda handler code
│ │ ├── requirements.txt # Python dependencies
│ │ └── pyproject.toml # Python project metadata
│ ├── main.tf # Terraform provider config
│ ├── variables.tf # Input variables
│ ├── data.tf # Data sources
│ ├── s3.tf # S3 bucket resources
│ ├── lambda.tf # Lambda function
│ ├── ecr.tf # Container registry
│ └── iam.tf # IAM roles and policies
├── .gitignore
└── README.md
To destroy all created resources:
cd infra
terraform destroyNote: Ensure S3 buckets are empty before destroying, or set force_delete = true.
Potential improvements for this POC:
- Implement actual data transformation logic in Lambda
- Add Lambda environment variables for configuration
- Implement DLQ (Dead Letter Queue) for failed executions
- Add VPC configuration for private networking
- Implement CloudWatch alarms for monitoring
- Add Lambda layers for shared dependencies
- Implement Step Functions for complex workflows
- Add data quality checks and validation
- Implement incremental processing patterns
This is a POC project for evaluation purposes.