VoxelRT

Voxel rendering experiments

Benchmark of acceleration structures

Implementations

PlainDDA: Incremental DDA over flat grid
MultiDDA: Incremental DDA over 2-level grid (8³ bricks)
eXtendedBrickMap: Space skipping over 3-level grid + 4³ occupancy bitmasks (4³ sectors -> 8³ bricks)
ManhattanDF/EuclideanDF: Space skipping over tiled 256³ distance fields at 1:4 resolution + 4³ occupancy bitmasks
OctantDF: Space skipping over tiled 256³ 8-directional distance fields at 1:4 resolution + 4³ occupancy bitmasks
ESVO: Port of "Efficient Sparse Voxel Octrees", no contours, no beam optimization
Tree64: Sparse voxel 4³-tree
BrickBVH: Binary BVH as TLAS + DDA over 8³ brick leafs (software impl)
GreedyMesh: Rasterized greedy mesh

Maybe TODO:

HybridBVH: BVH2 or CWBVH as TLAS + XBrickMap(32³ sectors) or Tree64 leafs + partial splits

Characteristics

Method	Accel method	Mem cost (impl)	Edit cost	Advantages	Drawbacks
PlainDDA	none	1 bit per 1³	O(1)	good start point	not fast enough
MultiDDA	space part	1 bit per 1³ + 1 bit per 8³	O(1)	?	simd divergence
XBrickMap	space part	8³ bits per brick + 12 bytes per sector	~O(1), *1	?	?
ManhattanDF	ray march	1 byte per 4³ + 1 bit per 1³	O(m)	?	?
EuclideanDF	ray march	2 bytes per 4³ + 1 bit per 1³	O(m)	?	?
OctantDF	ray march	8 bytes per 4³ + 1 bit per 1³	O(m)	?	?
ESVO	space part	~4 bytes per node	~O(log m), *2	contours?	hard to edit
Tree64	space part	12 bytes per node	~O(log m), *2	lower overhead vs octrees	hard to edit
BrickBVH	geom part	16 bytes per node + 8³ bits per brick	varies	flexible, hw accel, well researched	?
Meshing	raster	8 bytes per quad	O(m)	flexible	hard to scale, only primary rendering, indirect

*1: XBrickMap supports arbitrary edits within a brick, but insertions and deletions may require reallocations at sector level to make space for new bricks.
*2: Tree edits may require reallocations up to the root node to make space for new children nodes. This is not currently implemented.

Results

TL;DR:

Throughput ratio for primary rays vs +1 bounce diffuse:

Labels:

Mrays/s: million ray casts per second
Mrays/s PT1: total mrays over 1 primary + 1 diffuse bounce
Iters/ray: average number of traversal iterations per ray
Clocks/iter: average GPU clocks per traversal iteration (based on subgroup clockARB())
Sync: upload and initialization cost of GPU buffers (per scene, not super accurate. Voxel materials only implemented in XBrickMap and Tree64)

All tests were run on an integrated GPU. I guesstimate at least 5-10x throughput on real hardware.

Scene: Sponza 2k (499k * 8³ voxels)

Method	Mrays/s	Mrays/s PT1	Iters/ray	Clocks/iter	GPU sync	CPU sync
PlainDDA	31.4	3.2 (0.10x)	391.7	67.8	133.2 ms	44.3 ms
MultiDDA	147.7	64.6 (0.44x)	58.1	79.1	148.1 ms	37.4 ms
XBrickMap	165.6	125.1 (0.76x)	19.2	186.9	171.8 ms	150.5 ms
ESVO	95.0	66.2 (0.70x)	46.9	148.3	0.0 ms	652.7 ms
Tree64	182.6	124.2 (0.68x)	19.9	170.7	0.0 ms	1318.2 ms
BrickBVH	175.0	102.5 (0.59x)	27.3	123.6	0.0 ms	490.3 ms
ManhattanDF	166.3	93.2 (0.56x)	34.3	106.7	1003.1 ms	608.3 ms
EuclideanDF	174.3	30.0 (0.17x)	39.7	97.2	1476.6 ms	755.0 ms
OctantDF	129.0	84.9 (0.66x)	31.1	168.5	475.9 ms	656.5 ms

Scene: Eco House 2k (448k * 8³ voxels)

Method	Mrays/s	Mrays/s PT1	Iters/ray	Clocks/iter	GPU sync	CPU sync
PlainDDA	26.4	3.8 (0.14x)	434.7	69.7	128.5 ms	35.6 ms
MultiDDA	71.4	34.6 (0.48x)	112.6	89.0	148.6 ms	33.2 ms
XBrickMap	96.3	76.5 (0.79x)	27.8	239.1	117.7 ms	129.6 ms
ESVO	67.1	52.7 (0.79x)	59.9	170.8	0.0 ms	651.2 ms
Tree64	107.5	93.9 (0.87x)	26.8	231.1	0.0 ms	1371.3 ms
BrickBVH	91.5	63.4 (0.69x)	33.2	212.6	0.0 ms	510.4 ms
ManhattanDF	132.0	55.8 (0.42x)	34.3	149.5	638.3 ms	379.0 ms
EuclideanDF	84.5	12.5 (0.15x)	45.2	196.0	1041.3 ms	446.6 ms
OctantDF	99.3	69.0 (0.69x)	28.4	225.2	316.0 ms	423.8 ms

Scene: Bistro 4k (975k * 8³ voxels)

Method	Mrays/s	Mrays/s PT1	Iters/ray	Clocks/iter	GPU sync	CPU sync
PlainDDA	27.7	4.4 (0.16x)	436.4	67.9	164.2 ms	122.1 ms
MultiDDA	79.3	41.3 (0.52x)	107.4	82.3	151.3 ms	87.0 ms
XBrickMap	109.6	87.0 (0.79x)	27.9	205.6	216.9 ms	191.9 ms
ESVO	67.9	52.5 (0.77x)	71.1	141.3	0.0 ms	1308.4 ms
Tree64	109.7	92.6 (0.84x)	30.9	193.3	0.0 ms	2011.6 ms
BrickBVH	88.3	63.0 (0.71x)	50.4	146.0	0.0 ms	1109.2 ms
ManhattanDF	109.8	47.4 (0.43x)	53.8	115.2	2119.2 ms	1331.7 ms
EuclideanDF	101.1	28.5 (0.28x)	66.6	105.4	2897.5 ms	1235.0 ms
OctantDF	87.8	53.0 (0.60x)	41.4	192.0	1101.6 ms	1486.3 ms

Scene: San Miguel 4k (681k * 8³ voxels)

Method	Mrays/s	Mrays/s PT1	Iters/ray	Clocks/iter	GPU sync	CPU sync
PlainDDA	33.1	2.6 (0.08x)	354.3	69.0	141.9 ms	59.9 ms
MultiDDA	97.2	48.0 (0.49x)	66.2	106.3	158.3 ms	53.7 ms
XBrickMap	109.8	95.7 (0.87x)	25.2	226.0	168.6 ms	152.4 ms
ESVO	67.9	54.5 (0.80x)	68.4	147.1	0.0 ms	818.1 ms
Tree64	112.1	99.5 (0.89x)	28.8	205.7	0.7 ms	1453.9 ms
BrickBVH	88.0	62.9 (0.71x)	42.7	171.3	0.0 ms	717.8 ms
ManhattanDF	105.0	60.1 (0.57x)	48.3	132.3	819.2 ms	774.2 ms
EuclideanDF	99.0	16.9 (0.17x)	60.9	115.6	1905.2 ms	661.4 ms
OctantDF	82.2	65.8 (0.80x)	37.7	217.4	593.9 ms	753.2 ms

Scene: Forest Lake 2k (807k * 8³ voxels)

Method	Mrays/s	Mrays/s PT1	Iters/ray	Clocks/iter	GPU sync	CPU sync
PlainDDA	23.6	23.2 (0.99x)	510.4	67.5	154.7 ms	74.4 ms
MultiDDA	60.6	31.4 (0.52x)	119.7	101.1	134.8 ms	110.5 ms
XBrickMap	88.8	71.6 (0.81x)	26.2	278.1	170.4 ms	218.7 ms
ESVO	57.4	43.5 (0.76x)	63.3	189.3	0.0 ms	1202.1 ms
Tree64	93.1	83.3 (0.89x)	27.4	262.6	0.0 ms	1692.5 ms
BrickBVH	68.7	46.6 (0.68x)	41.5	238.9	0.0 ms	957.4 ms
ManhattanDF	88.4	50.7 (0.57x)	43.3	179.4	1131.9 ms	767.6 ms
EuclideanDF	80.2	10.1 (0.13x)	56.7	158.4	1997.2 ms	692.3 ms
OctantDF	71.0	48.6 (0.69x)	35.0	282.8	605.5 ms	786.3 ms

Scene: NY Highway I95 4k (574k * 8³ voxels)

Method	Mrays/s	Mrays/s PT1	Iters/ray	Clocks/iter	GPU sync	CPU sync
PlainDDA	24.0	23.8 (0.99x)	512.0	67.4	141.2 ms	43.0 ms
MultiDDA	42.5	34.1 (0.80x)	312.1	59.3	124.1 ms	50.0 ms
XBrickMap	115.2	88.5 (0.77x)	31.3	181.9	136.1 ms	144.7 ms
ESVO	136.9	86.4 (0.63x)	33.2	139.0	0.0 ms	719.0 ms
Tree64	208.9	137.4 (0.66x)	16.0	186.8	0.0 ms	1433.0 ms
BrickBVH	215.9	154.0 (0.71x)	14.2	174.1	0.0 ms	571.7 ms
ManhattanDF	100.3	53.0 (0.53x)	26.2	250.2	916.4 ms	943.8 ms
EuclideanDF	27.4	11.1 (0.41x)	29.0	870.8	2477.6 ms	814.6 ms
OctantDF	131.2	87.8 (0.67x)	24.9	198.1	723.5 ms	915.5 ms

Scene refs:

Cursory overview and observations

MultiDDA is a simple brickmap implementation that uses two nested DDA traversal loops to perform space skipping, one at 8³ scale and the other at voxel scale. Despite its simplicity, it performs surprisingly well relative to other techniques when considering only primary rays, and even though skips are limited to only one scale.

The poor performance with incoherent rays happens due to control-flow divergence when entering the inner traversal loop, since not all SIMD lanes will hit a brick at the same time as the others and thus become "stalled" for the duration of the inner loop. Other methods are not immune to incoherent rays due to poor memory access patterns, but divergence in this particular case has a significant impact.

I tried to mitigate divergence in two ways by following ideas from BVH traversal literature, without much success. The first attempt was to use wave intrinsics to postpone the inner traversal until enough lanes were active; the second, was to un-nest and pair the DDA loops, such that the inner DDA runs only after the brick DDA finds a hit, following control flow re-convergence.

XBrickMap attempts to improve on the idea of skipping through multiple grid levels by taking advantage of tiled bitmasks to perform more general hierarchical skips, using bitwise tests to identify sub-sections of empty voxels (as in 4³, 2³, 1³ cuboids). The bitmasks are also used to omit empty bricks within each sector (left-packing) for additional memory savings. Random access is made possible by using popcnt instructions to count the number of preeceding entries at any given index.

Traversal was implemented using an algorithm similar to ray marching. At each iteration, voxel positions are computed by intersecting with the three back-facing cube planes, then the brickmap layers are queried in order to determine the step size. This is simpler and more efficient than DDA variants for stepping by arbitrary amounts, but requires workarounds such as biasing the intersection distances to prevent infinite backtracking due to floating-point limitations.

Ray marching through distance fields is another very simple and intuitive traversal method, but the naive approach of storing one distance value per voxel is impractical for bigger grids due to the associated memory and especially bandwidth costs, the latter which becomes problematic even for primary rays as they lose coherence the furthest they get from origin. Moreover, traversal is not considerably shorter than that of methods relying on space partitioning (~8%), and pre-processing has a significant cost (10x more GPU time than XBrickMap).

Memory costs can be reduced by lowering the field resolution, without much detriment to traversal efficiency. Following the theme of 64-bit masks, I picked 1:4 scale (one distance value per 64 voxels) combined with occupancy masks for the final implementation. The masks are queried whenever sampled distance values are zero.

Dense areas with fine geometry (like foliage) are worst cases for ray traversal because they cause grazing rays to advance slowly, motivating the idea of "directional" or "anisotropic" distance fields. OctantDF implements this idea by creating 8 separate manhattan fields that each consider only voxels ahead of the associated ray octant direction (in other words, distances are only propagated across rows of opposing axis directions in respect to the octant). Compared to traditional distance fields, it lowers the overall number of iterations by ~20-25%, and increases throughput for incoherent rays by ~30-50% (why though?) - but ultimately it is still much worse than space-partitioning methods.

I have not closely inspected the performance characteristics of ESVO, but believe the main reason for its underwhelming performance is the fact that it takes many extra iterations to ascend and descend the tree and advance positions by DDA. Given that it was primarily designed to encode smooth surfaces, other approaches for octree traversal [C. Crassin et al, 2009], [V. Havran, 1999] could offer better performance.

Despite, the throughput ratio for incoherent rays is comparable to others, possibly due to low memory bandwidth, given that nodes are very small (4 bytes each) and localized (building tree in depth first order results in a z-curve layout).

The traversal is a near 1:1 port of the original ESVO implementation, and only node encoding was changed: relative pointers are "backwards" to allow for trees to be built without intermediate buffers (i.e. parent nodes appear after their children), and far pointers are signaled by a reserved upper range rather than a dedicated bit.

The two main advantages wide 4³ trees have over standard octrees are the overall shallower trees and lower overhead per voxel. Based on rudimentary tests, I found that most scenes consume about half as much memory as ESVO.

Traversal was at first implemented using top-down recursive DDA over bitmasks of each node, which performed around 11-15% faster than ESVO for primary rays.

Like in XBrickMap, I found that a ray marching algorithm was both simpler and faster, as it benefits from bitmasks for longer ray steps and is less prone to divergence. Initially, this performed very close to XBrickMap but did not surpass it until the two following optimizations (which combined, summed up to an 18% improvement).

Further using group-shared memory for the ancestor node stack resulted in ~9% speedup compared to a "local" array variable (this trick had a surprisingly negative impact when applied to ESVO which was 120% slower, presumably due to its 4x bigger stack).

Mirroring the coordinate system to the negative ray octant resulted in another 11% bump due to simplified intersection calculations (no need to offset face planes, simplified node bounds clamping).

Apart from increased memory efficiency due to sparseness, the most interesting advantage trees have over grids might be support for arbitrary voxel scales and sub-divisions, as in dynamic details. Apart from LOD streaming, this could also be useful for things like instancing and animations, which can be implemented by simply indirecting leafs to different subtrees.

BVHs have several advantages over all other methods. First, they are industry standard and have been extensively researched, and are quickly gaining hardware acceleration support. They are not limited to voxels and support traditional triangle-based geometry, transformations, and reasonably efficient updates.

TODO: expand

Rasterization without LODs is vastly inefficient with small voxels due to the huge amount of primitives generated. Greedy meshing does not seem to provide significant improvements (~15%) unless the number of materials is very limited or decoupled from the mesh, and voxels form long spans of columns or flat planes (~30%-20x). Although, even Sponza at 1k³ resolution with meshing in 62³ chunks and a single material is nearly in the order of a million quads.

TODO: provide actual numbers, adhoc impl does not integrate with benchmark runner because it has nothing to do with rt?

Misc refs

"Efficient Sparse Voxel Octrees" - Samuli Laine, Tero Karras
"A Survey on Bounding Volume Hierarchies for Ray Tracing" - Daniel Meister, Shinji Ogaki, Carsten Benthin, Michael J. Doyle, Michael Guthe, Jiří Bittner
Real-time Ray tracing and Editing of Large Voxel Scenes - Thijs van Wingerden
"A high performance realtime CUDA voxel [brick map] path tracer" - stijnherfst
"Binary Greedy Meshing" - cgerikj

Gallery

Early screenshots of the CPU(bottom) and GPU(top) renderers, with 2-bounce diffuse lighting.

Building

Build requirements: CMake, vcpkg (set VCPKG_ROOT envvar), and Clang. (clang-cl shipped with Visual Studio should also work.)
Run requirements: Vulkan 1.3 and AVX2. Probably also luck for my dicey Vulkan and Slang code.

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.vscode		.vscode
assets		assets
deps		deps
docs		docs
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoxelRT

Benchmark of acceleration structures

Implementations

Characteristics

Results

Cursory overview and observations

Misc refs

Gallery

Building

About

Releases

Packages

Languages

License

dubiousconst282/VoxelRT

Folders and files

Latest commit

History

Repository files navigation

VoxelRT

Benchmark of acceleration structures

Implementations

Characteristics

Results

Cursory overview and observations

Misc refs

Gallery

Building

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages