Skip to content

Conversation

@SIDDHANTCOOKIE
Copy link

Sorry for the confusion earlier. I’ve closed the PR against main and opened a new one targeting the dev branch as told. Thanks for the clarification @pradeeban

Problem

The current search pipeline relies exclusively on uploader provided metadata. These metadata fields are vectorized and used for similarity search against data buyer queries.

Because search is limited to metadata alone, results do not always reflect the actual contents of uploaded files. This reduces relevance, robustness, and the overall quality of semantic discovery for data buyers.

Solution

This PR introduces content aware indexing and retrieval by extracting and indexing actual file content in addition to uploader metadata.

Key improvements include:

Content Extraction

Data is extracted directly from uploaded files such as Excel and CSV during the anonymization stage.

Dual Vector Storage

Metadata and extracted content are stored in separate ChromaDB collections:

document_metadata

document_content

Weighted Search

A new /search_enhanced endpoint combines similarity scores from both collections using configurable weights.
The default weighting is 60 percent content and 40 percent metadata.

Improved Relevance

Search results are now ranked using real file content rather than relying only on uploader summaries.

Changes

Python Backend (python_backend/main.py)

Added StoreWithContentRequest model with optional extracted_content

Introduced separate ChromaDB collections for metadata and content

Implemented ContentExtractor for spreadsheet and CSV parsing

Enhanced /store endpoint to chunk and index extracted content

Added /search_enhanced endpoint with weighted semantic ranking

Documented /search_enhanced in the root endpoint response

JavaScript Backend

Added contentExtractorController.js for spreadsheet and CSV parsing

Updated anonymizeController.js to return extractedContent in API responses

Content extraction runs post anonymization and skips PHI or anonymized identifiers

Tests

test_store_with_content validates enhanced storage with extracted content

test_search_enhanced validates combined retrieval with weighted scoring

test_store_backward_compatibility ensures existing /store requests continue to work unchanged

JavaScript test validating extractedContent in /api/anonymize responses

Technical Details

Content chunking uses approximately 500 character segments for improved semantic granularity

/search_enhanced deduplicates results by grouping content chunks under their parent document

Fully backward compatible with no API or request format changes

Content extraction failures do not block uploads and default to empty content

Testing

Python tests (requires backend running on port 3002)
cd python_backend
python -m pytest tests -v

JavaScript tests (requires backend running on port 3001)
cd javascript_backend
npm test

Issue Reference

Fixes #42
This PR addresses the approach proposed in issue #42 by incorporating extracted file content into semantic retrieval alongside metadata based search.

@SIDDHANTCOOKIE
Copy link
Author

Hey @pradeeban @karthiksathishjeemain quick follow up on the PR. Would love your review whenever it’s convenient.

@pradeeban
Copy link
Contributor

@karthiksathishjeemain Does this PR satisfy the issue #42?

@karthiksathishjeemain
Copy link
Contributor

@pradeeban, the approach in this PR resolves the issue to a greater extent. I haven't checked the code yet due to my current engagement. It is safer to merge it on dev branch. Also, @SIDDHANTCOOKIE there appears to be a merge conflict, please resolve it.

@SIDDHANTCOOKIE
Copy link
Author

@karthiksathishjeemain I’ve resolved the merge conflicts. They were caused by earlier PRs being merged. Everything is fixed now, and it’s safe to merge now. Thankyou.

@karthiksathishjeemain
Copy link
Contributor

Good. @pradeeban please merge this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants