Skip to content

Pure Python vector database • int8 quantized • ~1100 QPS @ 50k vectors • single file • no compile • MIT

License

Notifications You must be signed in to change notification settings

SherifSystems/PythonVectorDB

Repository files navigation

PythonVectorDB

Python License Tests

Pure Python vector database with Int8 quantization and lazy deletion.


🚀 Features

  • 🧠 Int8 Quantization: 4x memory savings with minimal accuracy loss
  • ⚡ Fast Search: Numba-optimized cosine similarity with parallel processing
  • 🗑️ Lazy Deletion: Efficient deletion with threshold-based compaction
  • 🔒 Thread-Safe: All operations protected by locks
  • 💾 Binary Save/Load: Fast persistence using NumPy's compressed format

📦 Installation

pip install numpy numba

Then copy pythonvectordb.py to your project.


🎯 Quick Start

import numpy as np
from pythonvectordb import PythonVectorDB

# Create database
db = PythonVectorDB(dimension=128)

# Add vectors
vectors = np.random.randn(1000, 128).astype(np.float32)
db.add_vectors(vectors)

# Search
query = np.random.randn(128).astype(np.float32)
results = db.search(query, k=10)

for vector_id, score, metadata in results:
    print(f"{vector_id}: {score:.4f}")

📚 API Reference

Initialize

db = PythonVectorDB(dimension=128, initial_capacity=10000)

Add Vectors

db.add_vectors(
    vectors,              # np.ndarray of shape (n, dimension)
    vector_ids=None,      # Optional list of IDs
    metadata=None         # Optional list of dicts
)

Search

results = db.search(
    query,                # np.ndarray of shape (dimension,)
    k=10,                 # Number of results
    filter_fn=None        # Optional filter function
)
# Returns: List[(vector_id, score, metadata)]

**Performance Note:** Heavy metadata filtering on >300k vectors adds Python-side overhead.
For high-volume filtering, pre-partition data or use external ID filtering.

Save/Load

db.save("database.npz")
db = PythonVectorDB.load("database.npz")

Delete Vector

db.delete_vector(vector_id)  # Lazy deletion

Get Stats

stats = db.get_stats()
print(stats)  # Memory usage, QPS, latencies

⚡ Performance

Tested on 100K vectors (128 dimensions):

Database Size Search QPS Memory/Vector
1,000 vectors 16,619 QPS 640 bytes
10,000 vectors 3,676 QPS 466 bytes
50,000 vectors 1,159 QPS 608 bytes
100,000 vectors 448 QPS 466 bytes

Peak Performance:

  • Insert: 1.27M vectors/sec (1000 batch)
  • Memory Efficiency: 466 bytes/vector (4x savings vs float32)

🧪 Testing

Run the comprehensive test suite:

pip install -r requirements.txt  # Install all dependencies including psutil for benchmarks
python benchmark_suite.py    # Performance benchmarks

All tests pass on the first run – no setup required.


📄 License

MIT License – see pythonvectordb.py for details.


🤝 Contributing

Issues and PRs welcome! This is a single-file project – keep it simple.


PythonVectorDB – the vector database that actually works in pure Python.

⭐ Star this repo🐛 Report Issues📖 Documentation

About

Pure Python vector database • int8 quantized • ~1100 QPS @ 50k vectors • single file • no compile • MIT

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages