Real Time Cardinality Estimator

This project was created for the Future of Database Programming Contest. We achieved first place 🥇 in the challenge organized by Athena Research Center and the DareLab group.

Team Members

Competition Overview

In this contest the task was to build a real-time cardinality estimator for a table containing two integer columns. Cardinality estimation is at the heart of query optimization in relational databases. Accurate estimations guide the optimizer to choose a good execution plan; an error in estimation can lead to inefficient query plans.

This year's challenge had to tackle:

Frequent data changes.
Correlation between columns.
Data skewness.

Except for the estimation, the estimator had to be efficient in terms of memory usage and execution time. The estimator was evaluated based on the following criteria:

Data and Operations: The initial data volume of each test case cannot exceed 50,000,000, including 20,000,000 insert, delete, and query operations. The data value ranges from 1 to 20,000,000.
Memory Usage: Limit the memory usage to 4MB.
Execution Time: The estimator must be able to run in under 10s for each test case.
Scoring: The score is based on error ( the lower the better ). Assuming n is the number of queries, the error calculation formula is as follows:

$$\text{score} = \frac{\sum_{i=0}^{n} \left| \log\left(\frac{\text{EstimatedValue}_i + 1}{\text{RealAnswer}_i + 1}\right) \right|}{n}$$

More details about the task can be seen on the contest details page.

Build Instructions

Generating Data and Answers

For the competition, a class named DataExecuterDemo was provided to generate random data at runtime of the estimator. However, the correct answers were computed also at runtime, which made the execution time longer. To improve the performance, a new class named DataExecuterPrecomputed was created. This class reads data and precomputed answers from files.

Now to take advantage of this new class, we needed to generate the data and correct answers files. The data generation is handled by the main file in the project root (main.cpp). A provided Makefile in the project root lets you control the size of the data used. Once the main program is run, it generates two files:

A data file.
A file containing the correct answers.

The generation process is quite similar to the one used in the provided DataExecuterDemo class. The main difference is that for the calculation of the correct answers, OpenMP library is used to speed up the process. Using this method, a dataset of 5 million rows for each column and 2 million queries can be generated in 2 minutes on a 12-thread machine.

Compiling the Estimator

To compile the estimator, follow these steps:

Create a Build Directory

In your terminal, in the CardinalityEstimation folder run:
```
mkdir build && cd build
```
Run CMake

Generate the build files with CMake:
```
cmake ..
```
Build the Project

Compile the project:
```
make
```
Run the Estimator

The executable is named main and it accepts the following arguments:
```
./main <data_file> <answers_file>
```

Project Overview

The estimator is composed of several components:

Histogram-Based Estimators:
Histograms are used to divide data values into different intervals (buckets) and record the frequency of data in each bucket (HistogramBucketed.hpp). Two key classes for the histogram implementation are the following.
Buckets:
Of course, having a histogram structure that contains all the possible values is not feasible in practice. To address this issue, we use a class named Bucket that groups a range of values and their frequencies. The HistogramBucketed class employs Bucket objects to track value frequencies across specified ranges. To enhance estimation accuracy, each Bucket is further divided into sub-buckets. This hierarchical structure allows for more precise frequency tracking when it is required.
BinaryCache:
The BinaryCache class is used in the estimation when we have greater (>) as an operation in the query. Instead of iterating through all the buckets and summing their frequencies, we employ a binary structure, where each lower level represents smaller intervals, precomputed to contain the correct number of elements for efficient aggregation. This approach reduces the estimation complexity to logarithmic time while maintaining minimal memory usage.
HyperLogLog (HLL):
The HLL class provides a fast estimation of distinct values based on the HyperLogLog algorithm. The implementation of this class is based on the HyperLogLog paper
MCV (Most Common Value) Tracking:
The MCVTrack class keeps counts for values that occur frequently. By exploiting the tendency of common values to appear early and frequently, it significantly improves the estimator's accuracy.
Third-Party Tools:
Besides the standard library, the project uses:
- rapidhash for an efficient hash function used in HLL.
- robin hood unordered map for improved performance over STL’s unordered map.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CardinalityEstimation		CardinalityEstimation
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real Time Cardinality Estimator

Team Members

Competition Overview

Build Instructions

Generating Data and Answers

Compiling the Estimator

Project Overview

About

Releases

Packages

Contributors 3

Languages

License

stegiks/RealTimeCardEstimator

Folders and files

Latest commit

History

Repository files navigation

Real Time Cardinality Estimator

Team Members

Competition Overview

Build Instructions

Generating Data and Answers

Compiling the Estimator

Project Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages