Add parallelized-embedding-pipeline #403

solaws · 2025-08-28T16:48:57Z

This project contains a document vectorization pipeline using AWS services, specifically designed to process text, PDF, and Word documents, extract their content, generate vector embeddings in parallel and store them in a PostgreSQL database optimized for vector searches.

Thank you

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Enhanced README.md with workflow diagram and detailed architecture - Added complete example-workflow.json with all required metadata - Created resources folder with workflow diagram and author photos - Added professional author information for Solomon Ojo and Dave Horne - Included comprehensive deployment guides and resource links - Ready for AWS Step Functions workflows collection contribution

- Complete document vectorization pipeline implementation - Enhanced README.md with workflow diagram and comprehensive documentation - Added example-workflow.json with all required metadata for AWS samples - Included resources folder with workflow diagram and author photos - Added professional author information for Solomon Ojo and Dave Horne - All Lambda functions, deployment scripts, and configuration files - Ready for production use and AWS Step Functions workflows collection

bfreiberg · 2025-09-18T19:32:51Z

parallelized-embedding-pipeline/README.md

+
+## Architecture
+
+![Architecture Diagram](./resources/architecture.png)


It would be great to add a sentence to the overview above how the different "layers (Raw, Cleaned, and Curated) fit into this.

bfreiberg · 2025-09-18T19:34:31Z

parallelized-embedding-pipeline/README.md

+
+### Workflow Diagram
+
+The following diagram illustrates the complete Step Functions workflow for the document vectorization pipeline:


Suggested change

The following diagram illustrates the complete Step Functions workflow for the document vectorization pipeline:

The following diagram illustrates the complete AWS Step Functions workflow for the document vectorization pipeline:

Please use the full service name on each first mention

bfreiberg · 2025-09-18T19:35:15Z

parallelized-embedding-pipeline/README.md

+- **S3 Bucket**: Stores documents in different stages (raw, cleaned, curated)
+- **SQS Queue**: Handles document processing events
+- **Step Functions**: Orchestrates the document processing workflow
+- **Lambda Functions**: Process documents and generate embeddings
+- **Aurora PostgreSQL**: Database with pgvector extension for storing vectors


Add correct service name prefixes

bfreiberg · 2025-09-18T19:36:37Z

parallelized-embedding-pipeline/README.md

+4. **To test database connectivity:** Use `./test-connection.sh`
+5. **To test full pipeline functionality:** Use `./test-functionality.sh`
+
+### Deployment Scripts


Is this table really necessary given the section above?

bfreiberg · 2025-09-18T19:37:03Z

parallelized-embedding-pipeline/README.md

+1. VPC with public and private subnets
+2. Aurora PostgreSQL database with pgvector support
+3. Lambda functions for document processing
+4. Step Functions for workflow orchestration


Suggested change

4. Step Functions for workflow orchestration

4. Step Functions state machine for workflow orchestration

bfreiberg · 2025-09-18T19:53:53Z