Skip to content

Writing a fast Vector Database in Python

Thomas Dybdahl Ahle edited this page Apr 19, 2023 · 2 revisions
A K-Nearest Neighbour Graph clustered using the K-Means algorithm

In today's world, high-dimensional vectors are used to represent various types of data, such as images, text, and audio. Searching for similar vectors is a common task in many applications, including recommendation systems, image retrieval, and natural language processing. One popular approach for fast similarity search is using a Vector Database, which stores and indexes high-dimensional vectors to allow for efficient querying. In this article, we will explore how to implement a fast Vector Database in Python using Product Quantization.

The method we will use consist of two parts:

  • A space partition (we will use K-Means)
  • Product Quantization with Asymmetric Distance Computation.

By the end of this article, we will have a working… Since PQ is the most interesting part, we'll jump into that first.

Understanding Product Quantization

Product Quantization (PQ) is a vector quantization technique used to approximate high-dimensional vectors efficiently. It works by dividing the input vector into multiple sub-vectors and quantizing them separately using codebooks. This results in a compact and computationally efficient representation of the original vector, enabling faster similarity search and reduced storage requirements.

The key steps in Product Quantization are:

  • Splitting the input vector into sub-vectors
  • Training codebooks for each sub-vector
  • Quantizing the sub-vectors using the trained codebooks
  • Combining the quantized sub-vectors to create the final quantized vector

Building the Vector Database

TODO

Clone this wiki locally