Serverless PDF to DOCX Converter using AWS ECS Fargate

A cloud-native solution for converting PDF files to DOCX format using serverless AWS services and ECS Fargate.

Project Overview

This project demonstrates a scalable PDF conversion pipeline using:

Amazon S3 for file storage
Amazon SQS for message queuing
AWS Lambda for event-driven triggers
ECS Fargate for containerized processing
VPC with isolated networking

Key Features

🚀 Fully Serverless Architecture
🔄 Event-driven processing via S3 triggers
📦 Containerized conversion logic in ECS Fargate
⚡ Auto-scaling infrastructure
🔒 Secure VPC configuration with private subnets
📊 CloudWatch monitoring for logs and metrics

Architecture

graph TD
    A[User] -->|Upload PDF| B[(Amazon S3)]
    
    subgraph AWS["AWS Cloud"]
        subgraph VPC["VPC (10.0.0.0/16)"]
            subgraph PublicSubnet["Public Subnet"]
                I[Internet Gateway]
            end
            
            subgraph PrivateSubnet1["Private Subnet (ECS & Lambda)"]
                H[AWS Lambda]
                D[ECS Fargate Tasks]
                E[VPC Endpoints]
            end
        end
        
        B -->|Event Notification| C{Amazon SQS}
        C -->|Triggers| H
        H -->|Invokes| D
        D -->|Pull PDF| B
        D -->|Store DOCX| B
        D -->|Logs| G[Amazon CloudWatch]
        H -->|Logs| G
    end

Workflow

User uploads PDF to S3 bucket
S3 event notification sends message to SQS queue
Lambda function triggered by SQS messages
ECS Fargate task processes PDF conversion
Converted DOCX file stored back in S3
All logs streamed to CloudWatch

Prerequisites

AWS Account with appropriate permissions
AWS CLI v2 installed and configured
Terraform (for infrastructure deployment)
Docker (for container image creation)

Deployment Steps

Clone repository:

git clone https://github.com/your-repo/pdf-to-docx-converter.git
cd pdf-to-docx-converter

Build Docker image:
```
docker build -t pdf-converter .
```
Terraform initialization:
```
cd infrastructure
terraform init
```
Deploy infrastructure:
```
terraform apply -auto-approve
```
Upload test PDF to the created S3 bucket

Configuration

Component	Environment Variables
ECS Task	`S3_BUCKET`, `QUEUE_URL`
Lambda Function	`ECS_CLUSTER`, `TASK_DEFINITION`

Monitoring

All services send logs to CloudWatch:

Conversion metrics under /aws/ecs/pdf-conversion
Lambda invocation logs at /aws/lambda/pdf-trigger
S3 access logs in s3-access-logs

Cleanup

To remove all resources:

terraform destroy -auto-approve

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ecs-pdf-extractor		ecs-pdf-extractor
presigned_url		presigned_url
sqs-lambda-trigger		sqs-lambda-trigger
src		src
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
acm.tf		acm.tf
api_gateway.tf		api_gateway.tf
cloudfront.tf		cloudfront.tf
cloudwatch.tf		cloudwatch.tf
ecs.tf		ecs.tf
iam.tf		iam.tf
lambda.tf		lambda.tf
main.tf		main.tf
route53.tf		route53.tf
s3.tf		s3.tf
sg.tf		sg.tf
sqs.tf		sqs.tf
vpc.tf		vpc.tf
vpce.tf		vpce.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless PDF to DOCX Converter using AWS ECS Fargate

Project Overview

Key Features

Architecture

Workflow

Prerequisites

Deployment Steps

Configuration

Monitoring

Cleanup

About

Releases

Packages

Languages

License

hyperverseglobalconsulting/pdf_data_extraction_statless_comps

Folders and files

Latest commit

History

Repository files navigation

Serverless PDF to DOCX Converter using AWS ECS Fargate

Project Overview

Key Features

Architecture

Workflow

Prerequisites

Deployment Steps

Configuration

Monitoring

Cleanup

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages