Bucketed Cuckoo hash table written in PTX and JIT-compiled.
This repository aims to experiment with JIT compilations of hand-written PTX CUDA files (we also love assembly). We implemented insertion and query kernels for a bucketed cuckoo hash set entirely in NVIDIA's PTX.
Cuckoo hashing is a probing scheme that achieves very low number of probes at very high load factors. This implementation is probably the fastest GPU hash set implementation, but if you are interested in the state-of-the-art cuckoo hashing implementation, check out our bucketed cuckoo hashing implementation and paper.
git clone https://github.com/maawad/PTX_BCHT.git
cd PTX_BCHT
mkdir build && cd build
cmake ..
make
Requirement: NVIDIA GPU with Volta or later microarchitecture and CUDA.
Usage
./ptx_cuckoo_hashtable
--num-keys Number of keys
--load-factor Load factor of the hash set
--exist-ratio Ratio of queries that exist in the hash set
--device_id GPU device ID
Example:
# from the build directory:
./ptx_cuckoo_hashtable --num-keys=1'000'000 --load-factor=0.99 --exist-ratio=1.0
Output:
num_keys: 1000000
load_factor: 0.99
device_id: 0
Device[0]: NVIDIA TITAN V
--------------------------------------------------------
CUDA Link Completed in 0.000000 ms.
Linker Output:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'bcht_insert' for 'sm_70'
ptxas info : Function properties for bcht_insert
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 412 bytes cmem[0]
info : 0 bytes gmem
info : Function properties for 'bcht_insert':
info : used 28 registers, 0 stack, 0 bytes smem, 412 bytes cmem[0], 0 bytes lmem
--------------------------------------------------------
CUDA kernel: bcht_insert launched
--------------------------------------------------------
CUDA Link Completed in 0.000000 ms.
Linker Output:
ptxas info : 98 bytes gmem
ptxas info : Compiling entry function 'bcht_find' for 'sm_70'
ptxas info : Function properties for bcht_find
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 424 bytes cmem[0]
info : 98 bytes gmem
info : Function properties for 'bcht_find':
info : used 28 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem
--------------------------------------------------------
CUDA kernel: bcht_find launched
Success!
Find rate: 3160.4 million keys/s
Insert rate: 1650.99 million keys/s
Find ratio was: 100%
Below are results for building a hash table with a given number of keys then performing the same number of queries as the number of keys. Queries are performed for different positive queries ratios (e.g., 100% for queries that all exist in the hash set).
Millions | % | Million keys/s
Number of keys | Load factor | Insertion rate | Find rate
| | | 0% 50% 100%
50 | 60 | 1461.81 | 3708.09 3800.74 3897.83
50 | 65 | 1462.44 | 3663.85 3779.85 3889.45
50 | 70 | 1462.18 | 3791.30 3736.75 3888.21
50 | 75 | 1458.97 | 3487.22 3660.28 3991.88
50 | 80 | 1451.99 | 3258.47 3538.27 3864.82
50 | 82 | 1446.38 | 3302.10 3454.41 3965.90
50 | 84 | 1440.71 | 3017.43 3378.41 3822.16
50 | 86 | 1431.60 | 2855.28 3262.82 3808.15
50 | 88 | 1419.21 | 2684.79 3264.78 3766.44
50 | 90 | 1404.71 | 2593.79 2987.34 3727.34
50 | 91 | 1395.24 | 2400.24 2917.55 3712.32
50 | 92 | 1383.33 | 2381.16 2822.76 3684.59
50 | 93 | 1371.44 | 2190.39 2740.23 3652.07
50 | 94 | 1356.04 | 2153.77 2724.78 3724.78
50 | 95 | 1337.41 | 1969.59 2601.53 3668.25
50 | 96 | 1314.53 | 1850.32 2431.92 3621.73
50 | 97 | 1285.49 | 1735.31 2316.00 3537.24
50 | 98 | 1243.96 | 1622.30 2188.13 3340.51
50 | 99 | 1178.17 | 1499.96 2045.16 3178.91
If you have QT 6.3 or higher, you can also build a text editor where you can load the PTX kernels and edit them with an option to hot-reload and recompile the kernels with each edit.