embeddingsearch

embeddingsearch is a search server that uses Embedding Similarity Search (similiarly to Magna) to semantically compare a given input to a database of indexed entries.

embeddingsearch offers:

Privacy and flexibility through self-hosted solutions like:
- ollama
- OpenAI-compatible APIs (like LocalAI)
Great flexibility through deep control over
- the amount of datapoints per entity (i.e. the thing you're trying to find)
- which models are used (multiple per datapoint possible to improve accuracy)
- which models are sourced from where (multiple Ollama/OpenAI-compatible sources possible)
- similarity calculation methods
- aggregation of results (when multiple models are used per datapoint)

This repository comes with a

server (accessible via API calls & swagger)
clientside library (C#)
scripting based indexer service that supports the use of
- Python
- CSharp (Roslyn)
- Golang (Planned)
- Javascript (Planned)

How to set up / use

Server

(Docker now available! See Docker installation)

Install ollama
Pull a few models using ollama (e.g. paraphrase-multilingual, bge-m3, mxbai-embed-large, nomic-embed-text)
Install the depencencies
Set up a local mysql database
Set up the configuration
In src/server execute dotnet build && dotnet run to start the server
(optional) Create a searchdomain using the web interface

Client

Download the package and add it to your project (TODO: NuGet)
Create a new client by either:
1. By injecting IConfiguration (e.g. services.AddSingleton<Client>();)
2. By specifying the baseUri, apiKey, and searchdomain (e.g. new Client.Client(baseUri, apiKey, searchdomain))

Indexer

(Docker now available! See Docker installation)

Install the dependencies
Set up the server
Configure the indexer
Set up your indexing script(s)
Run with dotnet build && dotnet run (Or /usr/bin/dotnet build && /usr/bin/dotnet run)

Known issues

Issue	Solution
Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read()	The searchdomain you entered does not exist
Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES)	TBD
System.DllNotFoundException: Could not load libpython3.12.so with flags RTLD_NOW \| RTLD_GLOBAL: libpython3.12.so: cannot open shared object file: No such file or directory	Install python3.12-dev via apt. Also: try running the indexer using `/usr/bin/dotnet` instead of `dotnet` (make sure dotnet is installed via apt)

To-do

(High priority) Add default indexer
- Library
  - Processing:
    - Text / Markdown documents: file name, full text, paragraphs
    - Documents
      - PDF: file name, full text, headline?, paragraphs, images?
      - odt/docx: file name, full text, headline?, images?
      - msg/eml: file name, title, recipients, cc, text
    - Images: file name, OCR, image description?
    - Videos?
    - Presentations (Impress/Powerpoint): file name, full text, first slide title, titles, slide texts
    - Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page)
    - Other? (TBD)
- Server
  - ~~Scripting capability (Python; perhaps also lua)~~ (Done with the latest commits)
    - ~~Intended sourcing possibilities:~~
      - ~~Local/Remote files (CIFS, SMB, FTP)~~
      - ~~Database contents (MySQL, MSSQL)~~
      - ~~Web requests (E.g. manual crawling)~~
  - ~~Script call management (interval based & event based)~~
Implement ReaderWriterLock for entityCache to allow for multithreaded read access while retaining single-threaded write access.
NuGet packaging and corresponding README documentation
Add option for query result detail levels. e.g.:
- Level 0: {"Name": "...", "Value": 0.53}
- Level 1: {"Name": "...", "Value": 0.53, "Datapoints": [{"Name": "title", "Value": 0.65}, {...}]}
- Level 2: {"Name": "...", "Value": 0.53, "Datapoints": [{"Name": "title", "Value": 0.65, "Embeddings": [{"Model": "bge-m3", "Value": 0.87}, {...}]}, {...}]}
Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?)
Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support
Remove the id collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key.
Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.)
Implement dynamic invocation based database migrations

Future features

Support for other database types (MSSQL, SQLite)

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.vscode		.vscode
docs		docs
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

embeddingsearch

How to set up / use

Server

Client

Indexer

Known issues

To-do

Future features

Community

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

LD-Reborn/embeddingsearch

Folders and files

Latest commit

History

Repository files navigation

embeddingsearch

How to set up / use

Server

Client

Indexer

Known issues

To-do

Future features

Community

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages