embeddingsearch is a search server that uses Embedding Similarity Search (similiarly to Magna) to semantically compare a given input to a database of indexed entries.
embeddingsearch offers:
- Privacy and flexibility through self-hosted solutions like:
- ollama
- OpenAI-compatible APIs (like LocalAI)
- Great flexibility through deep control over
- the amount of datapoints per entity (i.e. the thing you're trying to find)
- which models are used (multiple per datapoint possible to improve accuracy)
- which models are sourced from where (multiple Ollama/OpenAI-compatible sources possible)
- similarity calculation methods
- aggregation of results (when multiple models are used per datapoint)
This repository comes with a
- server (accessible via API calls & swagger)
- clientside library (C#)
- scripting based indexer service that supports the use of
- Python
- CSharp (Roslyn)
- Golang (Planned)
- Javascript (Planned)
(Docker now available! See Docker installation)
- Install ollama
- Pull a few models using ollama (e.g.
paraphrase-multilingual
,bge-m3
,mxbai-embed-large
,nomic-embed-text
) - Install the depencencies
- Set up a local mysql database
- Set up the configuration
- In
src/server
executedotnet build && dotnet run
to start the server - (optional) Create a searchdomain using the web interface
- Download the package and add it to your project (TODO: NuGet)
- Create a new client by either:
- By injecting IConfiguration (e.g.
services.AddSingleton<Client>();
) - By specifying the baseUri, apiKey, and searchdomain (e.g.
new Client.Client(baseUri, apiKey, searchdomain)
)
- By injecting IConfiguration (e.g.
(Docker now available! See Docker installation)
- Install the dependencies
- Set up the server
- Configure the indexer
- Set up your indexing script(s)
- Run with
dotnet build && dotnet run
(Or/usr/bin/dotnet build && /usr/bin/dotnet run
)
Issue | Solution |
---|---|
Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist |
Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD |
System.DllNotFoundException: Could not load libpython3.12.so with flags RTLD_NOW | RTLD_GLOBAL: libpython3.12.so: cannot open shared object file: No such file or directory | Install python3.12-dev via apt. Also: try running the indexer using /usr/bin/dotnet instead of dotnet (make sure dotnet is installed via apt) |
- (High priority) Add default indexer
- Library
- Processing:
- Text / Markdown documents: file name, full text, paragraphs
- Documents
- PDF: file name, full text, headline?, paragraphs, images?
- odt/docx: file name, full text, headline?, images?
- msg/eml: file name, title, recipients, cc, text
- Images: file name, OCR, image description?
- Videos?
- Presentations (Impress/Powerpoint): file name, full text, first slide title, titles, slide texts
- Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page)
- Other? (TBD)
- Processing:
- Server
Scripting capability (Python; perhaps also lua)(Done with the latest commits)Intended sourcing possibilities:Local/Remote files (CIFS, SMB, FTP)Database contents (MySQL, MSSQL)Web requests (E.g. manual crawling)
Script call management (interval based & event based)
- Library
- Implement ReaderWriterLock for entityCache to allow for multithreaded read access while retaining single-threaded write access.
- NuGet packaging and corresponding README documentation
- Add option for query result detail levels. e.g.:
- Level 0:
{"Name": "...", "Value": 0.53}
- Level 1:
{"Name": "...", "Value": 0.53, "Datapoints": [{"Name": "title", "Value": 0.65}, {...}]}
- Level 2:
{"Name": "...", "Value": 0.53, "Datapoints": [{"Name": "title", "Value": 0.65, "Embeddings": [{"Model": "bge-m3", "Value": 0.87}, {...}]}, {...}]}
- Level 0:
- Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?)
- Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support
- Remove the
id
collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key. - Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.)
- Implement dynamic invocation based database migrations
- Support for other database types (MSSQL, SQLite)