Implemented issue #42 #105
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sorry for the confusion earlier. I’ve closed the PR against main and opened a new one targeting the dev branch as told. Thanks for the clarification @pradeeban
Problem
The current search pipeline relies exclusively on uploader provided metadata. These metadata fields are vectorized and used for similarity search against data buyer queries.
Because search is limited to metadata alone, results do not always reflect the actual contents of uploaded files. This reduces relevance, robustness, and the overall quality of semantic discovery for data buyers.
Solution
This PR introduces content aware indexing and retrieval by extracting and indexing actual file content in addition to uploader metadata.
Key improvements include:
Content Extraction
Data is extracted directly from uploaded files such as Excel and CSV during the anonymization stage.
Dual Vector Storage
Metadata and extracted content are stored in separate ChromaDB collections:
document_metadata
document_content
Weighted Search
A new /search_enhanced endpoint combines similarity scores from both collections using configurable weights.
The default weighting is 60 percent content and 40 percent metadata.
Improved Relevance
Search results are now ranked using real file content rather than relying only on uploader summaries.
Changes
Python Backend (python_backend/main.py)
Added StoreWithContentRequest model with optional extracted_content
Introduced separate ChromaDB collections for metadata and content
Implemented ContentExtractor for spreadsheet and CSV parsing
Enhanced /store endpoint to chunk and index extracted content
Added /search_enhanced endpoint with weighted semantic ranking
Documented /search_enhanced in the root endpoint response
JavaScript Backend
Added contentExtractorController.js for spreadsheet and CSV parsing
Updated anonymizeController.js to return extractedContent in API responses
Content extraction runs post anonymization and skips PHI or anonymized identifiers
Tests
test_store_with_content validates enhanced storage with extracted content
test_search_enhanced validates combined retrieval with weighted scoring
test_store_backward_compatibility ensures existing /store requests continue to work unchanged
JavaScript test validating extractedContent in /api/anonymize responses
Technical Details
Content chunking uses approximately 500 character segments for improved semantic granularity
/search_enhanced deduplicates results by grouping content chunks under their parent document
Fully backward compatible with no API or request format changes
Content extraction failures do not block uploads and default to empty content
Testing
Python tests (requires backend running on port 3002)
cd python_backend
python -m pytest tests -v
JavaScript tests (requires backend running on port 3001)
cd javascript_backend
npm test
Issue Reference
Fixes #42
This PR addresses the approach proposed in issue #42 by incorporating extracted file content into semantic retrieval alongside metadata based search.