Skip to content

Succinct

Milindu Sanoj Kumarage edited this page May 28, 2016 · 3 revisions

#Introduction

Using data scan to cater search queries takes a long time to process, but with low memory usage as nothing other than the data in loaded into the memory. We can improve the search query performance using indexes but the indexes takes more memory in addition to the original data. What Succinct is addressing this issues, where the data is compressed, but the compression structure works as an index in the same time. This means, the original data is compressed to reduce the memory usage and no more additional memory is needed for maintaining a secondary index. This also eliminates the need of data scans and data needs not to decompressed for searching.

Originally Succinct works on flat unstructured files, but is capable of working on key-value stores, document stores and tables by turning them into flat documents.

We can perform Search operations on Succinct as in other indexed based search engines like ElasticSearch. We can perform count operations very fast and rage queries in the same time complexity of search. We can append data to the Succinct datastore, which will go through the same compression process.

Succinct comes as a standalone distributed system, as a Spark package and as a library( in C++, Java and Scala ). We can use whichever that suits our needs.

##Tradeoffs

  1. Succinct has a preprocessing time, to compress data, which is about 8GB per hour per core.
  2. Succinct needs more computation for random access as the data should be decompressed.
  3. Succinct is designed for random access needs, and sequential scans would not benefit.
  4. Succinct does not support In-Place updates, but via appends we can workaround this.

##Analysis

We can see the query latencies against the input size for the three methods, data scan method, index method and Succinct.

For the data scan method, query latency gradually increases over the input size and once the input size is larger than the memory, query latency increases suddenly and then continue in gradual increase. Because, for data scan method, we load the data file into the memory and once the data file is larger than the memory, we have to do several disk reads to load the parts of the data files to the memory.

When we are maintaining an index, the memory goes out of space faster as now the index also takes space in memory, but we see very less query latency.

When we are using Succinct, it compresses the data files, thus allowing more data to be stored in memory, and additionally, it works as an index itself. This way, we can store more data in memory while benefitting the performance gain of indexing. However, once we go out of memory, Succinct would be a little bit slower than the indexing method.

This means, if we can fit our data into the memory, and we are accessing data randomly, not sequentially and need fast searching, counting, regex searching etc Succinct would be a good choice. However, adding new data and updating the data still takes time as they have to be compressed into Succinct data format.

Here is a benchmarking of Succinct with Spark and ElasticSearch. Graph is in log scale for query latency and in milliseconds.

##Usage

###Unstructured data

import	edu.berkeley.cs.succinct._	

val rdd	= ctx.textFile(...).map(_.getBytes)
val succinctRDD = rdd.succinct

val offsets  = succinctRDD.search("Berkeley") 
val offsets  = succinctRDD.regexSearch("William.*Clinton") 
val count = succinctRDD.count("Berkeley")
val bytes = succinctRDD.extract(50, 100)

###Key-Value data

import	edu.berkeley.cs.succinct.kv._	

val rdd	= ctx.textFile(...).map(_.getBytes)
val succinctKVRDD = rdd.succinctKV

val offsets  = succinctKVRDD.search("Berkeley") 
val offsets  = succinctKVRDD.regexSearch("William.*Clinton") 
val doc = succinctKVRDD.get(0) //Get the 0'th item
val bytes = succinctKVRDD.extract(0, 50, 100)	//Get 50 to 100 bytes of the 0'th item

###JSON data

import	edu.berkeley.cs.succinct.json._	

val rdd	= ctx.textFile(...).map(_.getBytes)
val succinctJsonRDD = rdd.succinctJson

val offsets  = succinctJsonRDD.search("Berkeley") 
val offsets  = succinctJsonRDD.filter("city”,”Berkeley") // Get "city”=”Berkeley" docs
val doc = succinctJsonRDD.get(0)

##What to expect in future

Succinct Encryption : Theoretically, encrypting the data while compressing and decrypting while decompressing is possible, but with slightly more computational. The Succinct team is working on this feature.

Succinct Graphs : On top of, Succinct datastore, the team is working on building a graph framework, where anyone can perform queries on the graph.

##References http://succinct.cs.berkeley.edu/ https://spark-summit.org/east-2016/events/succinct-spark-fast-interactive-queries-on-compressed-rdds/

Clone this wiki locally