Candidate Retrieval System

This repository contains my submission for the Mercor Search Engineer Take-Home Challenge. The task was to build a candidate search system over a dataset of 200k LinkedIn profiles using embeddings, filtering logic, and evaluation criteria.

The system is built using:

VoyageAI for generating embeddings
FAISS for fast approximate nearest-neighbor search
MongoDB for storing candidate data
Custom logic for filtering by hard and soft criteria

Problem Statement

The goal was to build a system that, given a job query (e.g., "Tax Lawyer"), returns the top 10 most relevant candidate profiles from the dataset. Each job query had a corresponding YAML file containing:

Hard Criteria (must be satisfied)
Soft Criteria (preferred features, ranked by relevance)

My Approach

Embeddings & Indexing
I used the voyage-3 model from VoyageAI to generate embeddings for each job query and each candidate (pre-generated and stored in the MongoDB embedding field). A FAISS HNSWFlat index was built over all candidates for fast search. The full dataset of 200k+ rows was embedded and indexed.
Retrieval Workflow
For each job title:
- The top k (default: all candidates) are retrieved from FAISS using embedding similarity.
- From these, only those passing all hard criteria (exact string match) are retained.
- The remaining candidates are scored on soft criteria using string similarity (difflib.SequenceMatcher) applied to their rerankSummary.
- The top 10 candidates by soft score are selected.
Evaluation
The system outputs the top 10 _ids per query to files like results/tax_lawyer_ids.txt. These are sent to Mercor’s evaluation API via evaluate.py to compute precision based on both hard and soft criteria.

Why I Didn’t Use LLM Re-Ranking

Although TurboPuffer or similar LLM-based reranking was recommended in the baseline, I chose not to use it for the following reason:

Despite multiple attempts, DNS resolution errors prevented successful API calls to the reranking endpoint (even after trying fixes like changing DNS to 8.8.8.8).
To keep the system reliable and fully functioning end-to-end, I opted for a lightweight local reranking method using string similarity, which still aligns well with the goal of evaluating candidate summaries against job criteria.

This ensured my system remained fully deterministic, dependency-free, and robust under load.

Result Files

The final result files for each public query are saved under the results/ folder. Each file contains 10 candidate MongoDB ObjectIDs:

Evaluation Scores

Query	Candidates	Avg Final Score
Tax Lawyer	10	31.00
Junior Corporate Lawyer	1	86.67
Radiology	10	41.33
Doctors (MD)	7	26.14
Biology Expert	10	23.83
Anthropology	0	0.00
Mathematics PhD	0	0.00
Quantitative Finance	10	7.00
Bankers	10	0.00
Mechanical Engineers	10	63.67

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
criteria		criteria
results		results
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
explore.py		explore.py
init.py		init.py
requirements.txt		requirements.txt
retrieve.py		retrieve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Candidate Retrieval System

Problem Statement

My Approach

Why I Didn’t Use LLM Re-Ranking

Result Files

Evaluation Scores

About

Uh oh!

Languages

pt413/People_Search_System

Folders and files

Latest commit

History

Repository files navigation

Candidate Retrieval System

Problem Statement

My Approach

Why I Didn’t Use LLM Re-Ranking

Result Files

Evaluation Scores

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages