Skip to content

A candidate retrieval system using FAISS over AI-embedded resume data and hard and soft criteria filtering. Outputs results in eval-ready format for private scoring.

Notifications You must be signed in to change notification settings

pt413/People_Search_System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Candidate Retrieval System

This repository contains my submission for the Mercor Search Engineer Take-Home Challenge. The task was to build a candidate search system over a dataset of 200k LinkedIn profiles using embeddings, filtering logic, and evaluation criteria.

The system is built using:

  • VoyageAI for generating embeddings
  • FAISS for fast approximate nearest-neighbor search
  • MongoDB for storing candidate data
  • Custom logic for filtering by hard and soft criteria

Problem Statement

The goal was to build a system that, given a job query (e.g., "Tax Lawyer"), returns the top 10 most relevant candidate profiles from the dataset. Each job query had a corresponding YAML file containing:

  • Hard Criteria (must be satisfied)
  • Soft Criteria (preferred features, ranked by relevance)

My Approach

  1. Embeddings & Indexing
    I used the voyage-3 model from VoyageAI to generate embeddings for each job query and each candidate (pre-generated and stored in the MongoDB embedding field). A FAISS HNSWFlat index was built over all candidates for fast search. The full dataset of 200k+ rows was embedded and indexed.

  2. Retrieval Workflow
    For each job title:

    • The top k (default: all candidates) are retrieved from FAISS using embedding similarity.
    • From these, only those passing all hard criteria (exact string match) are retained.
    • The remaining candidates are scored on soft criteria using string similarity (difflib.SequenceMatcher) applied to their rerankSummary.
    • The top 10 candidates by soft score are selected.
  3. Evaluation
    The system outputs the top 10 _ids per query to files like results/tax_lawyer_ids.txt. These are sent to Mercor’s evaluation API via evaluate.py to compute precision based on both hard and soft criteria.


Why I Didn’t Use LLM Re-Ranking

Although TurboPuffer or similar LLM-based reranking was recommended in the baseline, I chose not to use it for the following reason:

  • Despite multiple attempts, DNS resolution errors prevented successful API calls to the reranking endpoint (even after trying fixes like changing DNS to 8.8.8.8).
  • To keep the system reliable and fully functioning end-to-end, I opted for a lightweight local reranking method using string similarity, which still aligns well with the goal of evaluating candidate summaries against job criteria.

This ensured my system remained fully deterministic, dependency-free, and robust under load.


Result Files

The final result files for each public query are saved under the results/ folder. Each file contains 10 candidate MongoDB ObjectIDs:


Evaluation Scores

Query Candidates Avg Final Score
Tax Lawyer 10 31.00
Junior Corporate Lawyer 1 86.67
Radiology 10 41.33
Doctors (MD) 7 26.14
Biology Expert 10 23.83
Anthropology 0 0.00
Mathematics PhD 0 0.00
Quantitative Finance 10 7.00
Bankers 10 0.00
Mechanical Engineers 10 63.67

About

A candidate retrieval system using FAISS over AI-embedded resume data and hard and soft criteria filtering. Outputs results in eval-ready format for private scoring.

Topics

Resources

Stars

Watchers

Forks

Languages