Skip to content

Latest commit

 

History

History
393 lines (280 loc) · 24.5 KB

Data Structure In the Era of Big Data.md

File metadata and controls

393 lines (280 loc) · 24.5 KB

Data Structure In the Era of Big Data

Vector Search Engine

Vector search engine (aka neural search engine or deep search engine) uses deep learning models to encode data sets into meaningful vector representations, where distance between vectors represent the similarities between items.

Probabilistic data structures

Probabilistic data structures is a common name for data structures based mostly on different hashing techniques. Unlike regular (or deterministic) data structures, they always provide approximated answers but with reliable ways to estimate possible errors. Fortunately, the potential losses or errors are fully compensated for by extremely low memory requirements, constant query time, and scaling, three factors that become important in Big Data applications.

Hashing

To quote the hash function at Wikipedia:

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

Hashed indexes use a hashing function to compute the hash of the value of the index field. The hashing function collapses embedded documents and computes the hash for the entire value but does not support multi-key (i.e. arrays) indexes.

Real life data tends to get corrupted because machines (and humans) are never as reliable as we wish for. One efficient way is make sure your data wasn't unintendedly modified is to generate some kind of hash. That hash shall be unique, compact and efficient:

  • unique: any kind of modification to the data shall generate a different hash
  • compact: as few bits or bytes as possible to keep the overhead low
  • efficient: use little computing resources, i.e. fast and low memory usage

A hashing model takes an input data-point e.g. an image or document, and outputs a sequence of bits (hash code) representing that data-point.

Hashing models can be broadly categorized into two different categories: quantization and projection. The projection models focus on learning a low-dimensional transformation of the input data in a way that encourages related data-points to be closer together in the new space. In contrast, the quantization models seek to convert those projections into binary by using a thresholding mechanism. The projection branch can be further divided into data-independent, data-dependent (unsupervised) and data-dependent (supervised) depending on whether the projections are influenced by the distribution of the data or available class-labels.

Fowler-Noll-Vo Hash

The basis of the FNV hash algorithm was taken from an idea sent as reviewer comments to the IEEE POSIX P1003.2 committee by Glenn Fowler and Phong Vo back in 1991. In a subsequent ballot round: Landon Curt Noll improved on their algorithm. Some people tried this hash and found that it worked rather well. In an EMail message to Landon, they named it the ``Fowler/Noll/Vo'' or FNV hash.

FNV hashes are designed to be fast while maintaining a low collision rate. The FNV speed allows one to quickly hash lots of data while maintaining a reasonable collision rate. The high dispersion of the FNV hashes makes them well suited for hashing nearly identical strings such as URLs, hostnames, filenames, text, IP addresses, etc.

MurmurHash

MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup. The name comes from two basic operations, multiply (MU) and rotate (R), used in its inner loop. Unlike cryptographic hash functions, it is not specifically designed to be difficult to reverse by an adversary, making it unsuitable for cryptographic purposes.

Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is a class of methods for the nearest neighbor search problem, which is defined as follows: given a dataset of points in a metric space (e.g., Rd with the Euclidean distance), our goal is to preprocess the data set so that we can quickly answer nearest neighbor queries: given a previously unseen query point, we want to find one or several points in our dataset that are closest to the query point.

MinHash

MinHash was originally an algorithm to quickly estimate the Jaccard similarity between two sets but can be designed as a data structure that revolves around the algorithm. This is a probabilistic data structure that quickly estimates how similar two sets are.

Ranking Preserving Hashing

Rank Preserving Hashing (RPH) is to explicitly optimize the precision of Hamming distance ranking towards preserving the supervised rank information.

Cuckoo Hashing

Cuckoo Hashing is a technique for resolving collisions in hash tables that produces a dictionary with constant-time worst-case lookup and deletion operations as well as amortized constant-time insertion operations.

Consistent Hashing

Consistent hashing is done to implement scalability into the storage system by dividing up the data among multiple storage servers.

Random Projection and Hashing

LSH-Sampling

Bio-Inspired Hashing

The fruit fly Drosophila's olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search.

FlyHash

Bloom filters

Bloom filters, counting Bloom filters, and multi-hashing tables

Counting Bloom filters

The same property that results in false positives also makes it difficult to remove an element from the filter as there is no easy means of discerning if another element is hashed to the same bit. Unsetting a bit that is hashed by multiple elements can cause false negatives. Using a counter, instead of a bit, can circumvent this issue. The bit can be incremented when an element is hashed to a given location, and decremented upon removal. Membership queries rely on whether a given counter is greater than zero. This reduces the exceptional space-efficiency provided by the standard Bloom filter.

Bloomier Filter

Blocked Bloom filters

Blocked Bloom filters are a cache-efficient variant of Bloom filters, the well-known approximate set data structure. To quote Daniel Lemire, they have unbeatable speed. See the directory benchmarks/ to determine exactly how fast Blobloom is compared to other packages.

Multi-set filters

Quotient Filter

HyperLogLog

Skip Lists

An ordered-key based data structure that allows for competitive performance dictionary or list while implementation remaining relatively easy. This data structure proves that probability can work along with being able to quick index certain items based on probability.

Count-min Sketch

Approximate Nearest Neighbor Searching and Polytope Approximation

Proximity graphs

A proximity graph is a simply a graph in which two vertices are connected by an edge if and only if the vertices satisfy particular geometric requirements.

Navigable Small World Graph

Hierarchical Navigable Small World (HNSW) Graph

KNN graph

Compact Data Structures

Succinct data structures

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations.

The Succinct Data Structure Library (SDSL) contains many succinct data structures from the following categories:

  • Bit-vectors supporting Rank and Select
  • Integer Vectors
  • Wavelet Trees
  • Compressed Suffix Arrays (CSA)
  • Balanced Parentheses Representations
  • Longest Common Prefix (LCP) Arrays
  • Compressed Suffix Trees (CST)
  • Range Minimum/Maximum Query (RMQ) Structures

Learned Data Structure

Learning to Hash

By using hash-code to construct index, we can achieve constant or sub-linear search time complexity.

Hash functions are learned from a given training dataset.

Boosted LSH