-
Notifications
You must be signed in to change notification settings - Fork 49
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
- Loading branch information
1 parent
1465a96
commit 1737f00
Showing
3 changed files
with
51 additions
and
39 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
--- | ||
layout: single | ||
title: Internals | ||
description: "Internals" | ||
permalink: /internals/ | ||
toc: true | ||
toc_label: Contents | ||
toc_icon: cog | ||
--- | ||
|
||
This document contains some details about Elastiknn internals which might be useful for users. | ||
|
||
### Storing Model Parameters | ||
|
||
The LSH models all use randomized parameters to hash vectors. | ||
The simplest example is the bit-sampling model for Hamming similarity, parameterized by a list of randomly sampled indices. | ||
A more complicated example is the stable distributions model for L2 similarity, parameterized by a set of random unit vectors | ||
and a set of random bias scalars. | ||
These parameters aren't actually stored anywhere in Elasticsearch. | ||
Rather, they are lazily re-computed from a fixed random seed (0) each time they are needed. | ||
The advantage of this is that it avoids storing and synchronizing potentially large parameter blobs in the cluster. | ||
The disadvantage is that it's expensive to re-compute the randomized parameters. | ||
So instead we keep a cache of models in each Elasticsearch node, keyed on the model hyperparameters (e.g. `L`, `k`, etc.). | ||
The hyperparameters are stored inside the mappings where they are originally defined. | ||
|
||
### Transforming and Indexing Vectors | ||
|
||
Each vector is transformed (e.g. hashed) based on its mapping when the user makes an indexing request. | ||
All vectors store a binary [doc values field](https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html) | ||
containing a serialized version of the vector for exact queries, and vectors indexed using an LSH model index the hashes | ||
using a Lucene Term field. | ||
For example, for a sparse bool vector with a Jaccard LSH mapping, Elastiknn indexes the exact vector as a byte array in | ||
a doc values field and the vector's hash values as a set of Lucene Terms. |