PDF Processing AWS Infrastructure

This project builds an AWS infrastructure using AWS CDK (Cloud Development Kit) to split a PDF into chunks, process the chunks via AWS Step Functions, and merge the resulting chunks back using ECS tasks. The infrastructure also includes monitoring via CloudWatch dashboards and metrics for tracking progress.

Prerequisites

Before running the AWS CDK stack, ensure the following are installed and configured:

AWS Bedrock Access: Ensure your AWS account has access to the Claude Sonnet 3.5 V2 model in Amazon Bedrock.
- Request access to Amazon Bedrock through the AWS console if not already enabled.
Adobe API Access - An enterprise-level contract or a trial account (For Testing) for Adobe's API is required.
- Adobe PDF Services API to obtain API credentials.
Python (3.7 or later)
- Download Python
- Set up a virtual environment
```
python -m venv .venv
source .venv/bin/activate  # For macOS/Linux
.venv\Scripts\activate     # For Windows
```
- Also ensure that if you are using windows to confirm the python path in cmd before deploying. That can be done by running:
```
where python
```
AWS CLI: To interact with AWS services and set up credentials.
- Install AWS CLI
npm
- npm is required to install AWS CDK. Install npm by installing Node.js:
  - Download Node.js (includes npm).
- Verify npm installation:
```
npm --version
```
AWS CDK: For defining cloud infrastructure in code.
- Install AWS CDK
```
npm install -g aws-cdk
```
Docker: Required to build and run Docker images for the ECS tasks.
- Install Docker
- Verify installation:
```
docker --version
```
AWS Account Permissions
- Ensure permissions to create and manage AWS resources like S3, Lambda, ECS, ECR, Step Functions, and CloudWatch.
- AWS IAM Policies and Permissions
- Also, For the ease of deployment. Create a IAM user in the account you want to deploy to and attach adminstrator access to that user and use the Access key and Secret key for that user.

Directory Structure

Ensure your project has the following structure:

├── app.py (Main CDK app)
├── lambda/
│   ├── split_pdf/ (Python Lambda for splitting PDF)
│   └── java_lambda/ (Java Lambda for merging PDFs)
├── docker_autotag/ (Python Docker image for ECS task)
└── javascript_docker/ (JavaScript Docker image for ECS task)
|__ client_credentials.json (The client id and client secret id for adobe)

Setup and Deployment

Clone the Repository:
- Clone this repository containing the CDK code, Docker configurations, and Lambda functions.
Set Up Your Environment:
- Configure AWS CLI with your AWS account credentials:
```
aws configure
```
- Make sure the region is set to
```
us-east-1
```
Set Up CDK Environment:
- Bootstrap your AWS environment for CDK (run only once per AWS account/region):
```
cdk bootstrap
```
Create Adobe API Credentials:
- Create a file called client_credentials.json in the root directory with the following structure:
```
{
  "client_credentials": {
    "PDF_SERVICES_CLIENT_ID": "<Your client ID here>",
    "PDF_SERVICES_CLIENT_SECRET": "<Your secret ID here>"
  }
}
```
- Replace and with your actual Client ID and Client Secret provided by Adobe and not the whole file.

Upload Credentials to Secrets Manager:

Run this command in the terminal of the project to push the secret keys to secret manager:

For Mac

aws secretsmanager create-secret \
    --name /myapp/client_credentials \
    --description "Client credentials for PDF services" \
    --secret-string file://client_credentials.json

For Windows

aws secretsmanager create-secret --name /myapp/client_credentials --description "Client credentials for PDF services" --secret-string file://client_credentials.json

Run this command if you have already uploaded the keys earlier and would like to update the keys in secret manager.

For Mac:

   aws secretsmanager update-secret \
  --secret-id /myapp/client_credentials \
  --description "Updated client credentials for PDF services" \
  --secret-string file://client_credentials.json

For Windows:

aws secretsmanager update-secret --secret-id /myapp/client_credentials --description "Updated client credentials for PDF services" --secret-string file://client_credentials.json

Install the Requirements:
- For both Mac and Windows
- ```
pip install -r requirements.txt
```

Connect to ECR:

Ensure Docker Desktop is running, then execute:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com

Set a environment variable once for deployment
- An environment variable needs to be set before deployment. This step ensures compatibility and prevents deployment issues.
- For additional guidance or if you encounter any deployment issues, please refer to Troubleshooting section.
- For Mac,
```
export BUILDX_NO_DEFAULT_ATTESTATIONS=1   
```
- For Windows,
```
set BUILDX_NO_DEFAULT_ATTESTATIONS=1
```
Deploy the CDK Stack:

Deploy the stack to AWS:
```
cdk deploy
```

Usage

Once the infrastructure is deployed:

Create a pdf/ folder in the S3 bucket created by the CDK stack.
Upload a PDF file to the pdf/ folder in the S3 bucket.
The process will automatically trigger and start processing the PDF.

Monitoring

Use the CloudWatch dashboards created by the stack to monitor the progress and performance of the PDF processing pipeline.

Limitations

This solution does not remediate corrupted PDFs.
It can process scanned PDFs, but the output accuracy is approximately 80%.
It does not remediate fillable forms.
It does not handle color selection/contrast adjustments.

Troubleshooting

If you encounter any issues during setup or deployment, please check the following:

Ensure all prerequisites are correctly installed and configured.
Verify that your AWS credentials have the necessary permissions.
Check CloudWatch logs for any error messages in the Lambda functions or ECS tasks.
If the CDK Deploy responds with: Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases. Subprocess exited with error 9009, try changing "app": "python3 app.py" to "app": "python app.py" in the cdk.json file
If the CDK deploy responds with: Resource handler returned message: "The maximum number of addresses has been reached. request additional IPs from AWS. Go to https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas and search for "IP". Then, choose "EC2-VPC Elastic IPs". Note the AWS region is included in the URL, change it to the region you are deploying into. Requests for additional IPs are usually completed within minutes.
If any Docker images are not pushing to ECR, manually deploy to ECR using the push commands provided in the ECR console. Then, manually update the ECS service by creating a new revision of the task definition and updating the image URI with the one just deployed. For further assistance, please open an issue in this repository.
If you encounter issues with the 9th step, refer to the related discussion on the AWS CDK GitHub repository for further troubleshooting: CDK Github Issue. You can also consult our Troubleshooting CDK Deploy documentation for more detailed guidance.
If you continue to experience issues, please reach out to ai-cic@amazon.com for further assistance.

Additional Resources

For more details on the problem approach, industry impact, and our innovative solution developed by ASU CIC, please visit our blog: PDF Accessibility Blog

Contributing

Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes

Release Notes

See the latest Release Notes for version updates and improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.vscode		.vscode
cdk		cdk
docker_autotag		docker_autotag
javascript_docker		javascript_docker
lambda		lambda
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
TROUBLESHOOTING_CDK_DEPLOY.md		TROUBLESHOOTING_CDK_DEPLOY.md
app.py		app.py
cdk.json		cdk.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Processing AWS Infrastructure

Prerequisites

Directory Structure

Setup and Deployment

Usage

Monitoring

Limitations

Troubleshooting

Additional Resources

Contributing

Release Notes

About

Releases

Packages

Contributors 2

Languages

License

ASUCICREPO/PDF_Accessibility

Folders and files

Latest commit

History

Repository files navigation

PDF Processing AWS Infrastructure

Prerequisites

Directory Structure

Setup and Deployment

Usage

Monitoring

Limitations

Troubleshooting

Additional Resources

Contributing

Release Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages