Skip to content

Conversation

solaws
Copy link

@solaws solaws commented Aug 28, 2025

This project contains a document vectorization pipeline using AWS services, specifically designed to process text, PDF, and Word documents, extract their content, generate vector embeddings in parallel and store them in a PostgreSQL database optimized for vector searches.

Thank you

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

soojo added 2 commits August 28, 2025 12:41
- Enhanced README.md with workflow diagram and detailed architecture
- Added complete example-workflow.json with all required metadata
- Created resources folder with workflow diagram and author photos
- Added professional author information for Solomon Ojo and Dave Horne
- Included comprehensive deployment guides and resource links
- Ready for AWS Step Functions workflows collection contribution
@solaws solaws closed this Sep 11, 2025
@solaws solaws reopened this Sep 11, 2025
- Complete document vectorization pipeline implementation
- Enhanced README.md with workflow diagram and comprehensive documentation
- Added example-workflow.json with all required metadata for AWS samples
- Included resources folder with workflow diagram and author photos
- Added professional author information for Solomon Ojo and Dave Horne
- All Lambda functions, deployment scripts, and configuration files
- Ready for production use and AWS Step Functions workflows collection

## Architecture

![Architecture Diagram](./resources/architecture.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a sentence to the overview above how the different "layers (Raw, Cleaned, and Curated) fit into this.


### Workflow Diagram

The following diagram illustrates the complete Step Functions workflow for the document vectorization pipeline:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following diagram illustrates the complete Step Functions workflow for the document vectorization pipeline:
The following diagram illustrates the complete AWS Step Functions workflow for the document vectorization pipeline:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the full service name on each first mention

Comment on lines +55 to +59
- **S3 Bucket**: Stores documents in different stages (raw, cleaned, curated)
- **SQS Queue**: Handles document processing events
- **Step Functions**: Orchestrates the document processing workflow
- **Lambda Functions**: Process documents and generate embeddings
- **Aurora PostgreSQL**: Database with pgvector extension for storing vectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add correct service name prefixes

4. **To test database connectivity:** Use `./test-connection.sh`
5. **To test full pipeline functionality:** Use `./test-functionality.sh`

### Deployment Scripts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this table really necessary given the section above?

1. VPC with public and private subnets
2. Aurora PostgreSQL database with pgvector support
3. Lambda functions for document processing
4. Step Functions for workflow orchestration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Step Functions for workflow orchestration
4. Step Functions state machine for workflow orchestration

Properties:
CodeUri: functions/GetByteRangesFunction
Handler: app.lambda_handler
Runtime: python3.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above and applies to other functions too

Properties:
LayerName: shared-libraries
Description: Common libraries and secure utilities for document processing
ContentUri: functions/shared
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing?

ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSVPCFlowLogsRole

VPCFlowLogs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing this for a demo

Value: !Sub ${AWS::StackName}-vpc-flow-logs

# Enhance RDS Cluster Configuration
VectorDBCluster:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you already define this above?

PreferredBackupWindow: "03:00-04:00"
PreferredMaintenanceWindow: "mon:04:00-mon:05:00"

VectorDBInstance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you already define this above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants