-
Notifications
You must be signed in to change notification settings - Fork 134
Add parallelized-embedding-pipeline #403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Enhanced README.md with workflow diagram and detailed architecture - Added complete example-workflow.json with all required metadata - Created resources folder with workflow diagram and author photos - Added professional author information for Solomon Ojo and Dave Horne - Included comprehensive deployment guides and resource links - Ready for AWS Step Functions workflows collection contribution
- Complete document vectorization pipeline implementation - Enhanced README.md with workflow diagram and comprehensive documentation - Added example-workflow.json with all required metadata for AWS samples - Included resources folder with workflow diagram and author photos - Added professional author information for Solomon Ojo and Dave Horne - All Lambda functions, deployment scripts, and configuration files - Ready for production use and AWS Step Functions workflows collection
|
||
## Architecture | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a sentence to the overview above how the different "layers (Raw, Cleaned, and Curated) fit into this.
|
||
### Workflow Diagram | ||
|
||
The following diagram illustrates the complete Step Functions workflow for the document vectorization pipeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following diagram illustrates the complete Step Functions workflow for the document vectorization pipeline: | |
The following diagram illustrates the complete AWS Step Functions workflow for the document vectorization pipeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the full service name on each first mention
- **S3 Bucket**: Stores documents in different stages (raw, cleaned, curated) | ||
- **SQS Queue**: Handles document processing events | ||
- **Step Functions**: Orchestrates the document processing workflow | ||
- **Lambda Functions**: Process documents and generate embeddings | ||
- **Aurora PostgreSQL**: Database with pgvector extension for storing vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add correct service name prefixes
4. **To test database connectivity:** Use `./test-connection.sh` | ||
5. **To test full pipeline functionality:** Use `./test-functionality.sh` | ||
|
||
### Deployment Scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this table really necessary given the section above?
1. VPC with public and private subnets | ||
2. Aurora PostgreSQL database with pgvector support | ||
3. Lambda functions for document processing | ||
4. Step Functions for workflow orchestration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4. Step Functions for workflow orchestration | |
4. Step Functions state machine for workflow orchestration |
Properties: | ||
CodeUri: functions/GetByteRangesFunction | ||
Handler: app.lambda_handler | ||
Runtime: python3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above and applies to other functions too
Properties: | ||
LayerName: shared-libraries | ||
Description: Common libraries and secure utilities for document processing | ||
ContentUri: functions/shared |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing?
ManagedPolicyArns: | ||
- arn:aws:iam::aws:policy/service-role/AWSVPCFlowLogsRole | ||
|
||
VPCFlowLogs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing this for a demo
Value: !Sub ${AWS::StackName}-vpc-flow-logs | ||
|
||
# Enhance RDS Cluster Configuration | ||
VectorDBCluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't you already define this above?
PreferredBackupWindow: "03:00-04:00" | ||
PreferredMaintenanceWindow: "mon:04:00-mon:05:00" | ||
|
||
VectorDBInstance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't you already define this above?
This project contains a document vectorization pipeline using AWS services, specifically designed to process text, PDF, and Word documents, extract their content, generate vector embeddings in parallel and store them in a PostgreSQL database optimized for vector searches.
Thank you
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.