This project was created for the Future of Database Programming Contest. We achieved first place 🥇 in the challenge organized by Athena Research Center and the DareLab group.
In this contest the task was to build a real-time cardinality estimator for a table containing two integer columns. Cardinality estimation is at the heart of query optimization in relational databases. Accurate estimations guide the optimizer to choose a good execution plan; an error in estimation can lead to inefficient query plans.
This year's challenge had to tackle:
- Frequent data changes.
- Correlation between columns.
- Data skewness.
Except for the estimation, the estimator had to be efficient in terms of memory usage and execution time. The estimator was evaluated based on the following criteria:
-
Data and Operations: The initial data volume of each test case cannot exceed 50,000,000, including 20,000,000 insert, delete, and query operations. The data value ranges from 1 to 20,000,000.
-
Memory Usage: Limit the memory usage to 4MB.
-
Execution Time: The estimator must be able to run in under 10s for each test case.
-
Scoring: The score is based on error ( the lower the better ). Assuming n is the number of queries, the error calculation formula is as follows:
$$\text{score} = \frac{\sum_{i=0}^{n} \left| \log\left(\frac{\text{EstimatedValue}_i + 1}{\text{RealAnswer}_i + 1}\right) \right|}{n}$$
More details about the task can be seen on the contest details page.
For the competition, a class named DataExecuterDemo was provided to generate random data at runtime of the estimator. However, the correct answers were computed also at runtime, which made the execution time longer. To improve the performance, a new class named DataExecuterPrecomputed was created. This class reads data and precomputed answers from files.
Now to take advantage of this new class, we needed to generate the data and correct answers files. The data generation is handled by the main file in the project root (main.cpp). A provided Makefile in the project root lets you control the size of the data used. Once the main program is run, it generates two files:
- A data file.
- A file containing the correct answers.
The generation process is quite similar to the one used in the provided DataExecuterDemo class. The main difference is that for the calculation of the correct answers, OpenMP library is used to speed up the process. Using this method, a dataset of 5 million rows for each column and 2 million queries can be generated in 2 minutes on a 12-thread machine.
To compile the estimator, follow these steps:
-
Create a Build Directory
In your terminal, in the CardinalityEstimation folder run:
mkdir build && cd build
-
Run CMake
Generate the build files with CMake:
cmake ..
-
Build the Project
Compile the project:
make
-
Run the Estimator
The executable is named
main
and it accepts the following arguments:./main <data_file> <answers_file>
The estimator is composed of several components:
-
Histogram-Based Estimators:
Histograms are used to divide data values into different intervals (buckets) and record the frequency of data in each bucket (HistogramBucketed.hpp). Two key classes for the histogram implementation are the following. -
Buckets:
Of course, having a histogram structure that contains all the possible values is not feasible in practice. To address this issue, we use a class namedBucket
that groups a range of values and their frequencies. The HistogramBucketed class employs Bucket objects to track value frequencies across specified ranges. To enhance estimation accuracy, each Bucket is further divided into sub-buckets. This hierarchical structure allows for more precise frequency tracking when it is required. -
BinaryCache:
TheBinaryCache
class is used in the estimation when we have greater (>) as an operation in the query. Instead of iterating through all the buckets and summing their frequencies, we employ a binary structure, where each lower level represents smaller intervals, precomputed to contain the correct number of elements for efficient aggregation. This approach reduces the estimation complexity to logarithmic time while maintaining minimal memory usage. -
HyperLogLog (HLL):
TheHLL
class provides a fast estimation of distinct values based on the HyperLogLog algorithm. The implementation of this class is based on the HyperLogLog paper -
MCV (Most Common Value) Tracking:
TheMCVTrack
class keeps counts for values that occur frequently. By exploiting the tendency of common values to appear early and frequently, it significantly improves the estimator's accuracy. -
Third-Party Tools:
Besides the standard library, the project uses:- rapidhash for an efficient hash function used in HLL.
- robin hood unordered map for improved performance over STL’s unordered map.