Implemented issue #42 #105

SIDDHANTCOOKIE · 2026-01-28T22:17:59Z

Sorry for the confusion earlier. I’ve closed the PR against main and opened a new one targeting the dev branch as told. Thanks for the clarification @pradeeban

Problem

The current search pipeline relies exclusively on uploader provided metadata. These metadata fields are vectorized and used for similarity search against data buyer queries.

Because search is limited to metadata alone, results do not always reflect the actual contents of uploaded files. This reduces relevance, robustness, and the overall quality of semantic discovery for data buyers.

Solution

This PR introduces content aware indexing and retrieval by extracting and indexing actual file content in addition to uploader metadata.

Key improvements include:

Content Extraction

Data is extracted directly from uploaded files such as Excel and CSV during the anonymization stage.

Dual Vector Storage

Metadata and extracted content are stored in separate ChromaDB collections:

document_metadata

document_content

Weighted Search

A new /search_enhanced endpoint combines similarity scores from both collections using configurable weights.
The default weighting is 60 percent content and 40 percent metadata.

Improved Relevance

Search results are now ranked using real file content rather than relying only on uploader summaries.

Changes

Python Backend (python_backend/main.py)

Added StoreWithContentRequest model with optional extracted_content

Introduced separate ChromaDB collections for metadata and content

Implemented ContentExtractor for spreadsheet and CSV parsing

Enhanced /store endpoint to chunk and index extracted content

Added /search_enhanced endpoint with weighted semantic ranking

Documented /search_enhanced in the root endpoint response

JavaScript Backend

Added contentExtractorController.js for spreadsheet and CSV parsing

Updated anonymizeController.js to return extractedContent in API responses

Content extraction runs post anonymization and skips PHI or anonymized identifiers

Tests

test_store_with_content validates enhanced storage with extracted content

test_search_enhanced validates combined retrieval with weighted scoring

test_store_backward_compatibility ensures existing /store requests continue to work unchanged

JavaScript test validating extractedContent in /api/anonymize responses

Technical Details

Content chunking uses approximately 500 character segments for improved semantic granularity

/search_enhanced deduplicates results by grouping content chunks under their parent document

Fully backward compatible with no API or request format changes

Content extraction failures do not block uploads and default to empty content

Testing

Python tests (requires backend running on port 3002)
cd python_backend
python -m pytest tests -v

JavaScript tests (requires backend running on port 3001)
cd javascript_backend
npm test

Issue Reference

Fixes #42
This PR addresses the approach proposed in issue #42 by incorporating extracted file content into semantic retrieval alongside metadata based search.

SIDDHANTCOOKIE · 2026-02-06T04:28:17Z

Hey @pradeeban @karthiksathishjeemain quick follow up on the PR. Would love your review whenever it’s convenient.

pradeeban · 2026-02-09T19:34:56Z

@karthiksathishjeemain Does this PR satisfy the issue #42?

karthiksathishjeemain · 2026-02-09T19:44:28Z

@pradeeban, the approach in this PR resolves the issue to a greater extent. I haven't checked the code yet due to my current engagement. It is safer to merge it on dev branch. Also, @SIDDHANTCOOKIE there appears to be a merge conflict, please resolve it.

SIDDHANTCOOKIE · 2026-02-10T02:12:02Z

@karthiksathishjeemain I’ve resolved the merge conflicts. They were caused by earlier PRs being merged. Everything is fixed now, and it’s safe to merge now. Thankyou.

karthiksathishjeemain · 2026-02-10T02:42:42Z

Good. @pradeeban please merge this

implemented issue healthyinc#42

6fa64cb

SIDDHANTCOOKIE mentioned this pull request Jan 28, 2026

Create Enhanced Retrieval algorithm #42

Open

few files didn't commit

3f18f49

Merge branch 'dev' into feature/enhanced-retrieval-algorithm

c13ebd2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented issue #42 #105

Implemented issue #42 #105

Uh oh!

SIDDHANTCOOKIE commented Jan 28, 2026

Uh oh!

SIDDHANTCOOKIE commented Feb 6, 2026

Uh oh!

pradeeban commented Feb 9, 2026

Uh oh!

karthiksathishjeemain commented Feb 9, 2026

Uh oh!

SIDDHANTCOOKIE commented Feb 10, 2026

Uh oh!

karthiksathishjeemain commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implemented issue #42 #105

Are you sure you want to change the base?

Implemented issue #42 #105

Uh oh!

Conversation

SIDDHANTCOOKIE commented Jan 28, 2026

Problem

Solution

Key improvements include:

Content Extraction

Dual Vector Storage

Weighted Search

Improved Relevance

Changes

Python Backend (python_backend/main.py)

JavaScript Backend

Tests

Testing

Issue Reference

Uh oh!

SIDDHANTCOOKIE commented Feb 6, 2026

Uh oh!

pradeeban commented Feb 9, 2026

Uh oh!

karthiksathishjeemain commented Feb 9, 2026

Uh oh!

SIDDHANTCOOKIE commented Feb 10, 2026

Uh oh!

karthiksathishjeemain commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants