GitHub - matt77hias/GPURayTraversal: Understanding the Efficiency of Ray Traversal on GPUs

This repository was archived by the owner on Jan 3, 2020. It is now read-only.

matt77hias / GPURayTraversal Public archive

Notifications You must be signed in to change notification settings
Fork 15
Star 43

Understanding the Efficiency of Ray Traversal on GPUs

BSD-3-Clause license

43 stars 15 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README		README
benchmark.cmd		benchmark.cmd
framework.vcxproj		framework.vcxproj
framework.vcxproj.filters		framework.vcxproj.filters
results.txt		results.txt
rt.sln		rt.sln
rt.vcxproj		rt.vcxproj
rt.vcxproj.filters		rt.vcxproj.filters
title.png		title.png

Repository files navigation

Fast GPU Ray Traversal 1.4
--------------------------
Implementation by Tero Karras, Timo Aila, and Samuli Laine
Copyright 2009-2012 NVIDIA Corporation

This package contains full source code for the fast GPU-based ray traversal
routines used in the following paper:

"Understanding the Efficiency of Ray Traversal on GPUs",
Timo Aila and Samuli Laine,
Proc. High-Performance Graphics 2009
http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.pdf

In addition to the original kernels that were optimized for NVIDIA GTX 285,
the package also includes kernels specifically hand-tuned for GTX 480
(GF100/Fermi) and GTX 680 (GK104/Kepler). The results for these GPUs have
been published in the following technical report:

"Understanding the Efficiency of Ray Traversal on GPUs - Kepler and Fermi Addendum",
Timo Aila, Samuli Laine, and Tero Karras,
NVIDIA Technical Report NVR-2012-02,
http://research.nvidia.com/publication/understanding-efficiency-ray-traversal-gpus-kepler-and-fermi-addendum

The accompanying benchmark application and test scenes aim to replicate the
published results as accurately as possible, although there are slight
differences in the test setup (e.g. BVH builder, CUDA version).
See results.txt for details.

The source code is licensed under New BSD License (see LICENSE), and
hosted by Google Code:

http://code.google.com/p/understanding-the-efficiency-of-ray-traversal-on-gpus/

System requirements
-------------------

- Microsoft Windows XP, Vista, or 7.

- At least 1GB of system memory.

- NVIDIA CUDA-compatible GPU with compute capability 1.2 and at least 1.5
gigabytes of RAM. GeForce GTX 480 or GTX 680 is recommended.

- Microsoft Visual Studio 2010. Required even if you do not plan to build
the source code, as the runtime CUDA compilation mechanism depends on it.

Instructions
------------

1. Install Visual Studio 2010. The Express edition can be downloaded from:
http://www.microsoft.com/visualstudio/en-us/products/2010-editions/visual-cpp-express

2. Install the latest NVIDIA GPU drivers and CUDA Toolkit.
http://developer.nvidia.com/object/cuda_archive.html

3. Run rt.exe to start the application in interactive mode. The first
run executes certain initialization tasks that may take a while to
complete.

4. If you get an error during initialization, the most probable explanation
is that the application is unable to launch nvcc.exe contained in the
CUDA Toolkit. In this case, you should:

- Set CUDA_BIN_PATH to point to the CUDA Toolkit "bin" directory, e.g.
"set CUDA_BIN_PATH=C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin".

- Set CUDA_INC_PATH to point to the CUDA Toolkit "include" directory, e.g.
"set CUDA_INC_PATH=C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v4.2\include".

- Run vcvars32.bat to setup Visual Studio paths, e.g.
"C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\vcvars32.bat".

5. Run benchmark.cmd to measure the performance of all test scenes with
the same settings that were used in the paper. Expected performance
numbers for different GPUs are listed in "results.txt". Note that a
64-bit build is required to benchmark the San Miguel scene.

6. Optional: Build the application yourself.

- Open rt.sln in Visual Studio 2010.
- Right-click the "rt" project and select "Set as StartUp Project".
- Select Release build. Debug build is very slow, especially when sorting
secondary rays during the benchmark.
- Build and run.

Test setup
----------

Camera positions:

- Results of each scene are averaged over 5 distinct camera positions.
- Camera positions are specified on the command line using
"signature strings".
- To generate a signature string, click "Show camera controls" in the
interactive mode and then "Export camera signature..."

Ray generation:

- Viewport of 1024x768 pixels.
- Primary rays that hit scene geometry generate 32 secondary rays (AO or
diffuse) distributed according to a cosine-weighted Halton sequence on
the hemisphere.
- Primary rays that miss the geometry generate dummy secondary rays that
are ignored by the ray traversal kernel and excluded from the
rays-per-second figures.
- AO rays are of limited length and terminate immediately after encountering
an intersection.
- Diffuse interreflection rays are very long and continue the traversal
until they find the closest intersection.
- See src/rt/ray/RayGen.cpp

Batches and sorting:

- All primary rays are traced in a single launch.
- Secondary rays are divided into batches of 2^20 rays, and each batch
is traced in a separate launch.
- Primary rays are generated according to 2D Morton order in screen space.
- Each batch of secondary rays is sorted according to 6D Morton order based
on ray origin and direction vectors.
- The sorting of is performed by the CPU and is a quite heavy operation,
taking roughly one second per batch. It is disabled in the interactive
mode.
- See src/rt/ray/PixelTable.cpp and src/rt/ray/RayBuffer.cpp

Acceleration structure:

- AABB-based binary bounding volume hierarchy.
- Built using spatial triangle splits (Stich et al.) to improve quality:
http://www.nvidia.com/docs/IO/77714/sbvh.pdf
- Triangles are represented using Woop's affine triangle transformation:
http://www.sven-woop.de/publications/Diplom_SvenWoop_Final.pdf
- Memory layout varies between individual traversal kernels.
- See src/rt/bvh/SplitBVHBuilder.cpp and src/cuda/CudaBVH.cpp

Ray traversal:

- Performance results are based on the time spent in ray traversal for
the selected ray type. Ray generation and sorting are excluded from
the measurements.
- The code for launching the traversal kernels can be found in
src/cuda/CudaTracer.cpp
- The kernels themselves are located in src/rt/kernels:

fermi_speculative_while_while
Hand-tuned to yield the best performance on GTX 480.
Works on older GPUs as well, but is not optimal.

kepler_dynamic_fetch
Hand-tuned to yield the best performance on GTX 680.
Works on older GPUs as well, but is not optimal.

tesla_persistent_packet
"Persistent packet" kernel from the paper.

tesla_persistent_speculative_while_while
"Persistent speculative while-while" kernel from the paper.
This is the fastest kernel on GTX 285.

tesla_persistent_while_while
"Persistent while-while" kernel from the paper.

Version history
---------------

Version 1.4, May 22, 2012
- Include hand-tuned kernels for Kepler-based GPUs.
- Improve fermi_speculative_while_while perf using vmin/vmax PTX instructions.
- Include San Miguel test scene in the package.
- Improve robustness of the BVH builder with degenerate input.
- Switch to New BSD License (previously Apache License 2.0).
- Upgrade to Visual Studio 2010 (previously 2008).
- Fix a CUDA compilation issue with Visual Studio Express.
- General bugfixes and improvements to framework.

Version 1.3, Jul 08, 2011
- Fix compatibility issues with CUDA 4.0.

Version 1.2, Dec 17, 2010
- Fix issues with nvcc path autodetection with CUDA 3.2.

Version 1.1, Dec 01, 2010
- Update the codebase to support GF104 and CUDA 3.2.
- Speed up ray sorting significantly by utilizing all available CPU cores.
- Minor stability improvements.

Version 1.0, Jun 29, 2010
- Initial release.

Known issues
------------

- When using CUDA 3.2 or later, the performance of device-side code drops
slightly in 64-bit builds. This is because CUDA 3.2 disallows "mixed-bitness
mode", which we utilize on earlier CUDA versions to get maximum performance.
With CUDA 3.2, we must always compile device code with the same bitness
as host code, which generally results in higher register pressure in 64-bit
builds.

For more information, see:
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_3.2_Readiness_Tech_Brief.pdf

- The mesh importer only supports a limited subset of the Wavefront OBJ
file format. If you have trouble importing a mesh, you may want to try
enabling WAVEFRONT_DEBUG in src/framework/io/MeshWavefrontIO.cpp.

Acknowledgements
----------------

Anat Grynberg and Greg Ward for the Conference room model.
University of Utah for the Fairy scene.
Marko Dabrovic (www.rna.hr) for the Sibenik cathedral model.
Guillermo M. Leal Llaguno (www.evvisual.com) for the San Miguel model.