The high-performance .NET search engine based on pattern recognition
Infidex is a search engine based on pattern recognition. Learning from your data, Infidex automatically extracts features like frequency and rarity and embeds them into a multi-dimensional hypersphere for intelligent matching. This enables fuzzy querying with unparalleled typo tolerance, without any manual tuning. Zero dependencies, blazingly fast, built for developers who need search that just works.
- Blazingly Fast - Index thousands of documents per second, search in milliseconds
- Intelligent Matching - Finds what you're looking for even with typos and variations
- Per-Term Coverage - Ranks documents by how many query terms they match (more terms = higher rank)
- Rich Filtering - SQL-like query language (Infiscript) for complex filters
- Faceted Search - Build dynamic filters and aggregations
- Smart Ranking - Lexicographic (coverage, quality) scoring for principled relevance
- Multi-Field Search - Search across multiple fields with configurable weights
- Incremental Indexing - Add or update documents without rebuilding the entire index
- Fully Thread-Safe - Multiple concurrent readers, writers block readers and other writers
- Production Ready - Comprehensive test coverage, clean API, zero dependencies
- Easy Integration - Embeds directly into your .NET application
Install via NuGet:
dotnet add package Infidexusing Infidex;
using Infidex.Core;
// Create search engine
var engine = SearchEngine.CreateDefault();
// Index documents
var documents = new[]
{
new Document(1L, "The quick brown fox jumps over the lazy dog"),
new Document(2L, "A journey of a thousand miles begins with a single step"),
new Document(3L, "To be or not to be that is the question")
};
engine.IndexDocuments(documents);
// Search with typos - still finds matches!
var results = engine.Search("quik fox", maxResults: 10);
foreach (var result in results.Records)
{
Console.WriteLine($"Doc {result.DocumentId}: Score {result.Score}");
}using Infidex.Api;
// Define fields with weights
var matrix = new DocumentFields();
matrix.AddField("title", "The Matrix", Weight.High);
matrix.AddField("description", "A computer hacker learns about the true nature of reality", Weight.Low);
var inception = new DocumentFields();
inception.AddField("title", "Inception", Weight.High);
inception.AddField("description", "A thief who steals corporate secrets through dream-sharing", Weight.Low);
var movies = new[]
{
new Document(1L, matrix),
new Document(2L, inception)
};
engine.IndexDocuments(movies);Infiscript is a simple filtering language used to write intuitive filters that compile to optimized bytecode:
using Infidex.Api;
// Simple comparison
var filter = Filter.Parse("genre = 'Sci-Fi'");
// Boolean logic
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");
// Complex expressions with grouping
var filter = Filter.Parse("(genre = 'Fantasy' AND year >= 2000) OR (genre = 'Horror' AND year >= 1980)");
// String operations
var filter = Filter.Parse("title CONTAINS 'matrix'");
var filter = Filter.Parse("title STARTS WITH 'The'");
var filter = Filter.Parse("description LIKE '%dream%'");
// Range checks
var filter = Filter.Parse("year BETWEEN 2000 AND 2020");
var filter = Filter.Parse("rating >= 8.0");
// List membership
var filter = Filter.Parse("genre IN ('Sci-Fi', 'Fantasy', 'Adventure')");
// Null checks
var filter = Filter.Parse("director IS NOT NULL");
// Regex matching
var filter = Filter.Parse("email MATCHES '^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$'");
// Ternary expressions (conditional logic)
var filter = Filter.Parse("age >= 18 ? 'adult' : 'minor'");
var filter = Filter.Parse("score >= 90 ? 'A' : score >= 80 ? 'B' : 'C'");
// Use filters in queries
var query = new Query("matrix", maxResults: 20)
{
Filter = Filter.Parse("year >= 2000 AND rating > 7.0")
};
var results = engine.Search(query);Full EBNF specification is available here.
Comparison: =, !=, <, <=, >, >=
Boolean: AND (or &&), OR (or ||), NOT (or !)
String: CONTAINS, STARTS WITH, ENDS WITH, LIKE (% wildcard)
Special: IN, BETWEEN, IS NULL, IS NOT NULL, MATCHES (regex)
Conditional: ? : (ternary operator)
All operators are case-insensitive. Use parentheses for grouping.
Filters compile to portable bytecode for performance and serialization:
// Compile once, use many times
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");
var bytecode = filter.CompileToBytes();
// Save to disk
File.WriteAllBytes("filter.bin", bytecode);
// Load and use later
var loaded = Filter.FromBytecode(File.ReadAllBytes("filter.bin"));
var query = new Query("space") { CompiledFilterBytecode = bytecode };Build dynamic filters and navigate your data:
var query = new Query("science fiction", maxResults: 50)
{
EnableFacets = true
};
var results = engine.Search(query);
// Get facet counts
if (results.Facets != null)
{
foreach (var (fieldName, values) in results.Facets)
{
Console.WriteLine($"\n{fieldName}:");
foreach (var (value, count) in values)
{
Console.WriteLine($" {value}: {count} documents");
}
}
}
// Output:
// genre:
// Sci-Fi: 15 documents
// Fantasy: 8 documents
// Action: 5 documents
//
// year:
// 2020: 10 documents
// 2019: 8 documents
// 2018: 10 documentsIncrease relevance scores for specific documents:
// Boost recent movies
var recentBoost = new Boost(
Filter.Parse("year >= 2020"),
BoostStrength.Large // +20 to score
);
// Boost highly-rated content
var ratingBoost = new Boost(
Filter.Parse("rating >= 8.0"),
BoostStrength.Medium // +10 to score
);
var query = new Query("action movie", maxResults: 20)
{
EnableBoost = true,
Boosts = new[] { recentBoost, ratingBoost }
};
var results = engine.Search(query);Boost strengths: Small (+5), Medium (+10), Large (+20), Extreme (+40)
Sort results by any field:
// Sort by year (descending)
var query = new Query("thriller", maxResults: 20)
{
SortBy = fields.GetField("year"),
SortAscending = false
};
// Sort by rating (ascending)
var query = new Query("comedy", maxResults: 20)
{
SortBy = fields.GetField("rating"),
SortAscending = true
};Infidex uses a data-agnostic lexicographic ranking model that relies entirely on structural and positional properties of the match—no collection statistics (like IDF) influence the final ranking. This ensures explainable, intuitive results regardless of the corpus.
Stage 1: BM25+ Candidate Generation
- Tokenizes text into character n-grams (2-grams + 3-grams)
- Builds inverted index with document frequencies
- Uses BM25+ as the information retrieval backbone with parameters
$k_1 = 1.2$ ,$b = 0.75$ ,$\delta = 1.0$ - BM25+ scoring with L2-normalized term weights:
- Ultra-fast with byte-quantized weights (4x memory savings)
- Produces top-K candidates for Stage 2 refinement
Stage 2: Lexical Coverage Analysis
- Applied to top-K candidates from Stage 1
- Tracks per-term coverage for each query word using 5 algorithms:
- Exact whole-word matching
- Fuzzy word matching (adaptive Damerau-Levenshtein)
- Joined/split word detection
- Prefix/suffix matching (prefixes weighted higher than suffixes)
- LCS (Longest Common Subsequence) fallback when no word-level match exists
- For each query term
$q_i$ , computes per-term coverage:
where
- Derives coordination coverage across all
$n$ query terms:
- Extracts structural features: phrase runs, anchor token positions, lexical perfection
Stage 3: Lexicographic Score Fusion
The final score is a lexicographic tuple encoded into a 16-bit ushort:
where:
- Precedence (high byte, 8 bits): Discrete match-quality tiers
- Semantic (low byte, 8 bits): Continuous similarity within a tier
Documents are ranked by a strict precedence order—higher bits always dominate lower bits:
| Bit | Value | Condition | Rationale |
|---|---|---|---|
| 7 | 128 | All query terms found in document | Completeness is paramount |
| 6 | 64 | All terms fully matched (whole word OR exact prefix) | Clean matches beat partial |
| 5 | 32 | Single-term: Strict whole-word match Multi-term: Lexically perfect doc (every doc token explained by query) |
Exact intent signal |
| 4 | 16 | Single-term: Perfect doc Multi-term: Strict whole-word match |
Secondary exactness |
| 3 | 8 | First match at position 0 (title starts with query) | Positional primacy |
| 2 | 4 | Precise prefix match (query is prefix of token) | Partial but anchored |
The semantic component provides smooth differentiation within precedence tiers:
For single-term queries:
where
-
$L_{\text{substr}}$ : Substring containment score (position-weighted) -
$L_{\text{prefix}}$ : Longest prefix of doc token matching suffix of query -
$L_{\text{fuzzy}}$ : Damerau-Levenshtein similarity (transpositions allowed) -
$L_{\text{2seg}}$ : Two-segment alignment for concatenated queries
For multi-term queries:
where:
-
$C_{\text{avg}}$ = average per-term coverage -
$T_{\text{tfidf}}$ = normalized TF-IDF score from Stage 1 -
$R_{\text{phrase}}$ = phrase run bonus (consecutive query terms in document order) -
$\alpha + \beta + \gamma = 1$ (currently$\alpha = 0.4$ ,$\beta = 0.4$ ,$\gamma = 0.2$ )
For queries that appear to be concatenated words, Infidex detects two-segment alignment:
Given query
- Prefix fragment:
$q[0:\ell]$ where$\ell = \min(3, |q|/2)$ - Suffix fragment:
$q[|q|-\ell:|q|]$
If distinct document tokens
-
$t_i$ starts with or is a prefix of the prefix fragment -
$t_j$ ends with or is a suffix of the suffix fragment
Then:
For multi-term queries, the first query token acts as an anchor when:
- It has length ≥ 3 characters
- It appears as a substring in the document text
- There is a phrase run of ≥ 2 consecutive query terms
This is a pure positional/string rule—no rarity or IDF is used.
The 16-bit score is computed as:
where
This ensures:
- Completeness dominates (all terms found beats partial matches)
- Match quality refines (whole words beat prefixes beat fuzzy)
- Similarity differentiates (smooth ordering within tiers)
Infidex supports efficient binary serialization for persisting and loading search indexes:
var engine = SearchEngine.CreateDefault();
// Index your documents
engine.IndexDocuments(documents);
// Save to disk (binary format)
engine.Save("my-index.bin");// Load from disk with the same configuration
var engine = SearchEngine.Load(
filePath: "my-index.bin",
indexSizes: new[] { 3 },
startPadSize: 2,
stopPadSize: 0,
enableCoverage: true
);
// Search immediately (no re-indexing needed)
var results = engine.Search("query text", maxResults: 10);// Save asynchronously
await engine.SaveAsync("my-index.bin");
// Load asynchronously
var engine = await SearchEngine.LoadAsync(
filePath: "my-index.bin",
indexSizes: new[] { 3 },
startPadSize: 2,
stopPadSize: 0
);The binary format includes:
- All indexed documents (text, metadata, fields)
- Complete inverted index (terms and postings)
- TF-IDF weights (pre-normalized, byte-quantized)
- Document frequencies and statistics
Index Size: Typically much smaller than source data due to byte-quantized weights and compressed postings. Example: 40k movie titles = < 5 MB index. See this test.
SearchEngine.Searchis thread-safe and can be called from many threads concurrently.- Indexing operations (
IndexDocuments,IndexDocumentsAsync,IndexDocument,CalculateWeights,Save,Load) acquire an exclusive writer lock and block other operations while they run. - Filters and boosts are compiled once and cached in a thread-safe dictionary; execution uses thread-local virtual machines to avoid shared state.
- Share a single
SearchEngineinstance per application/service and let multiple threads use it concurrently.
var engine = SearchEngine.CreateDefault();
// - Multi-size n-grams: [2, 3]
// - Coverage: Enabled (multi-algorithm model: exact, LD1 fuzzy, prefix/suffix, join/split, LCS)
// - Balanced for speed and accuracyvar engine = SearchEngine.CreateMinimal();
// - Single n-gram: [3]
// - Coverage: Disabled
// - Faster but less accuratevar engine = new SearchEngine(
indexSizes: new[] { 2, 3 }, // Character n-gram sizes
startPadSize: 2, // Pad start of words
stopPadSize: 0, // Pad end of words
enableCoverage: true, // Enable lexical matching
textNormalizer: TextNormalizer.CreateDefault(),
tokenizerSetup: TokenizerSetup.CreateDefault(),
coverageSetup: CoverageSetup.CreateDefault(),
stopTermLimit: 1_250_000, // Max terms before stop-word filtering
wordMatcherSetup: new WordMatcherSetup
{
SupportLD1 = true, // Enable fuzzy matching
SupportAffix = true, // Enable prefix/suffix
MinimumWordSizeExact = 2,
MaximumWordSizeExact = 50,
MinimumWordSizeLD1 = 4,
MaximumWordSizeLD1 = 20
},
fieldWeights: new[] { 1.0f, 2.0f } // Per-field weights
);Comprehensive test suite with 300+ tests:
dotnet testSearchEngine - Main search engine
Document - Document with ID and fields
Query - Search query configuration
Filter - Base class for filters
DocumentFields - Field schema definition
Boost - Boost configuration
// Indexing
engine.IndexDocuments(IEnumerable<Document> docs)
await engine.IndexDocumentsAsync(IEnumerable<Document> docs)
// Searching
SearchResult Search(string text, int maxResults = 10)
SearchResult Search(Query query)
// Document management
Document? GetDocument(long documentKey)
List<Document> GetDocuments(long documentKey) // All segments
// Schema
engine.DocumentFieldSchema = fields;Check out Infidex.Example for complete working examples:
- MovieExample.cs - Multi-field movie search with faceting
- FilterExample.cs - Advanced Infiscript filtering
- BoostExample.cs - Document boosting strategies
This library is licensed under the MIT license. 💜