Skip to content

A zero-allocation, space-efficient, logarithmic-time solution in Go to the exact membership query problem

License

Notifications You must be signed in to change notification settings

circular-time/dejavu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Déjà vu

A zero-allocation, space-efficient, logarithmic-time solution in Go to the exact membership query problem, dejavu is a Go module that aims to provide an elegant way of answering the following yes-no question: "Is element x a member of the set S?"

Use dejavu when the probabilistic nature of approximate membership query structures does not satisfy application requirements.

package main

import (
	"fmt"
	"net"

	"github.com/circular-time/dejavu"
)

func main() {
	var (
		cache *dejavu.Cache = dejavu.NewCache128(1 << 22) // 4,194,304 elements
		seen  bool
		value = []byte(net.IPv6loopback)
	)

	fmt.Println(
		cache.Size(),
	)
	// 92274688
	// (88 MiB)

	seen, _ = cache.Recall(value)

	fmt.Println(seen)
	// false

	cache.Insert(value)

	seen, _ = cache.Recall(value)

	fmt.Println(seen)
	// true

	fmt.Println(
		cache.Length(),
	)
	// 1
}

Caveats

dejavu requires cached values to be of fixed length. Users working with data of varying sizes should consider caching not those values, but their hashes instead.

Furthermore, it stores data in a binary tree that is not self-balancing. Values that naturally occur in lexicographical order, such as time-based UUIDs, should be hashed (or have their bytes rearranged so that the random bits come first) before caching for optimal performance.

Deletion of inserted values is not currently supported! To overcome this limitation, consider using two or more Caches in rotation (as in log rotation).

Space complexity

dejavu.Cache occupies exactly n (log k + 2 log n) bits of memory to store n elements out of a set of k possibilities. For example, it would take 88 MiB to maintain a cache of more than four million IPv6 addresses.

The additional 2 log n per element above the theoretical baseline of log k is incurred to maintain a binary tree structure, as a trade-off against time complexity.

The Go runtime has its own overheads that add to total memory consumption.

Time complexity

cache.Insert() and cache.Recall() are operations in a binary tree averaging an equivalent time complexity of O(log n).

In the worst-case scenario where values are cached in ascending/descending order, the binary tree would end up a linked list with complexity O(n), in which case it would be more space-efficient to use a regular list of complexity Ω(n log k).

Benchmarks

goos: linux
goarch: arm64
pkg: github.com/circular-time/dejavu
BenchmarkCacheInsert
BenchmarkCacheInsert-2     2290011     739.7 ns/op     0 B/op     0 allocs/op
BenchmarkCacheRecall
BenchmarkCacheRecall-2     2246784     759.0 ns/op     0 B/op     0 allocs/op

Experiments suggest that dejavu.Cache is roughly two times more memory-efficient than a Go map[[16]byte]struct{} with the same number of keys, and somewhere in the region of ten times so compared to an LMDB instance holding the same number of records consisting of 128-bit keys and 0-bit values.

package main

import (
	"testing"

	"github.com/circular-time/dejavu"
)

const (
	n = 1 << 22 // 2^22 = 4,194,304
)

func BenchmarkDejavu(b *testing.B) {
	var (
		i int
	)

	for i = 0; i < b.N; i++ {
		dejavu.NewCache128(n)
	}
}

func BenchmarkMap(b *testing.B) {
	var (
		i int
	)

	for i = 0; i < b.N; i++ {
		_ = make(map[[16]byte]struct{}, n)
	}
}
BenchmarkDejavu
BenchmarkDejavu-2    1105    1475508 ns/op     92274688 B/op    1 allocs/op
BenchmarkMap
BenchmarkMap-2        592    2558122 ns/op    160432152 B/op    2 allocs/op