Skip to content

Conversation

@TheBlek
Copy link

@TheBlek TheBlek commented Nov 15, 2025

Hi @gkjohnson! Here's a simple lambertian pathtracer. Both megakernel and wavefront, as discussed in #692. See webgpu_primitives demo. There is still a lot of work to be done here. I will try to keep an up-to-date todo list here so you could follow the progress.

At what point do you propose measuring performance?

I think it is not so difficult to have both megakernel and wavefront pathtracers as they share the essential code (for now at least). And it could be useful to have a point of reference when implementing additional features. And I'm not sure how relative performance will scale with features.

In webgpu by default we can bind maximum 128Mb buffers and there is no way to request bigger limits in three.js. So that leaves us with max tile resolution of 1920x1080 (which I hardcoded in the example for now). I'm wondering whether requesting bigger webgpu limits or features for that matter is a worthwhile feature for three.js. What do you think?

This buffer limit also limits the scene size at least until the TLAS is deployed and tracing is per-mesh instead of on the whole scene.

Also by default we can bind a maximum of 8 buffers for use which is a bottleneck for traceRay kernel. It needs a lot of buffers so I had to move reading material index into the next kernel (bsdf eval phase). This could be fixed if three.js had support for array types in structs and top-level storage buffer backed structs. At least I could not get it to work, so I had to put queue sizes into a separate buffer and pass separately. I also wonder to write an issue about the limitation in three.js but maybe I just didn't find it.

What's your policy on three.js version for webgpu release here? Ideally we would use r181 as it includes support for indirect compute (which is essential for wavefront pathtracing to not read-back values from GPU between calls). But it released like 5 days ago :) For now I copy-pasted compute methods from WebGpuRenderer and WebGpuBackend with little tweaks in order to compute indirectly. see computeIndirect and computeBackendIndirect functions in PathTracerCore.

Just saw that the library recently upgraded to r181 - perfect!

Would love any feedback!

  • Update code to use r181
  • Investigate haloing under spheres in wavefront pathtracing
  • Implement tiled rendering to support resolutions greater than 1920x1080
  • Derive performance numbers from three.js'es metrics
  • Randomize generated rays for each pixel
    For later:
  • Implement physically-based materials (MeshStandartMaterial)
  • PERFORMANCE: before putting items into a storage queue, collect them on a workgroup-local queue for better parallelism
  • Implement environment map sampling
  • Implement russian roulette
  • Investigate more webgl features to be ported

@gkjohnson
Copy link
Owner

Amazing work! Thanks for putting the time in. In terms of getting an initial version merged I think these are some good goals. I think this should help get the general architecture defined and answer some of the open questions. I think your other TODOs can be added in subsequent PRs which will make things easier to review:

  • Demo with full resolution support and window resizing.
  • Lambert support only.
  • Support for megakernal & wavefront path tracing to compare performance.
  • Organize the files

In terms of making this easier to understand - do you mind splitting some of the code out into separate files? The "PathTracerCore" file is pretty monolithic and difficult to follow at the moment. I think these changes will make it easier to give some feedback:

  • I think we can move the compute nodes to a webgpu/nodes folder.
  • Lets remove commented code that's not being used
  • Can we document a bit more of what's happening especially in the wavefront path tracer step?

And some other comments:

So that leaves us with max tile resolution of 1920x1080

It would be helpful to know what data is needed at each step of the path tracer. But organizing things from the beginning so that tiled rendering is supported I think is our best bet. Once three.js supports requesting larger buffers users can use that but we'll still need a graceful fallback either way.

At what point do you propose measuring performance?

Let's get to a point where we have a full resolution scene rendering and we can take a look at the pros / cons of the megakernal and wavefront approach. The wavefront approach may also wind up having more of a noticeable improvement later on when we can terminate rays early and / or sort them and therefore avoid running large blocks of work.

Also just a note that the "megakernel" path seems to not be working for me. I see this error:

'TRI_INTERSECT_EPSILON'
    if ( abs( det ) < TRI_INTERSECT_EPSILON ) {
                      ^^^^^^^^^^^^^^^^^^^^^


 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor ""compute""]).

Copy link
Owner

@gkjohnson gkjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for organizing things - it helps quite a bit. I've just taken a deeper look at things and I think I'm understanding how things are roughly working, now. I've added a few initial questions but excuse me if I've asked anything obvious as I wrap my head around things 😅

` );

export const pcgInit = wgslFn( /* wgsl */`
fn pcg_initialize(state: ptr<function, PcgState>, p: vec2u, frame: u32) -> void {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there a reason we can't store the "state" variable as a global variable? It seems more simple than passing a state argument in every time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to think how to do that in nodes system initially. Will take a look if that's possible.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I was just curious if there was a reason it couldn't be done.


export const resetFn = wgslFn( /* wgsl */ `

fn reset(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we name this function and comment it so it's more clear what it's resetting? What is this for?

Copy link
Author

@TheBlek TheBlek Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shader resets result and sample count buffers, so any accumulated color stored becomes and sample count becomes 0. Will adjust name to better reflect the meaning

export const lambertBsdfFunc = wgslFn( /* wgsl */`
fn bsdfEval(rngState: ptr<function, PcgState>, normal: vec3f, view: vec3f) -> ScatterRecord {

const PI: f32 = 3.141592653589793;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move this to a constants node.

1,
];
// 0. Clean queues
_renderer.compute( this.cleanQueuesKernel, 1 );
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we want to / need to clear these kernels? The buffer size limits may make this more complicated but I was originally thinking of these queues as ongoing "work queues" that would have work constantly pushed onto them over multiple frames. Rays from still-unterminated paths from sample 1 could still be in the queue while rays from sample 2 (or 3 or 4) are being pushed onto the same queue, for example. Then there can always be work available on the queue for the compute nodes to pick up.

"Finishing" a render could mean a few things:

Small tiles can be run to start off new paths and fill up the queue as needed. This would also let users really tune in how much work they want to do per frame - rather than samples the use can say "only process X rays per frame" and that can be something that's auto-tunable.

And maybe a separate thought but I'm also wondering if sorting rays can somehow help with performance and thread utilization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than samples the use can say "only process X rays per frame" and that can be something that's auto-tunable.

I see. This sounds interesting and promising in terms of adapting to hardware's capabilities. However, I think this should be done later, after the essential stuff is setup. Right now, ray generation kernel fully loads the ray queue with rays without checking for overflows so we have to clean the queues.

And maybe a separate thought but I'm also wondering if sorting rays can somehow help with performance and thread utilization.

This is an interesting idea to explore. I can see how sorting ray in a certain way can improve thread utilization. Certainly a topic to explore.

Perhaps we should create issues for those and return later on?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can take a look at these later. I added some small notes to #692.

rather than samples the use can say "only process X rays per frame" and that can be something that's auto-tunable.

I see. This sounds interesting and promising in terms of adapting to hardware's capabilities. However, I think this should be done later

Once the resolution of the render in the demo is right and we can get this merged I'll take a look at this if that's okay. I'm pretty interested in it

Comment on lines +167 to +216
export const writeTraceRayDispatchSize = wgslFn( /* wgsl */ `
fn writeTraceRayDispatchSize(
outputBuffer: ptr<storage, array<u32>, read_write>,

queueSizes: ptr<storage, array<atomic<u32>>, read_write>,

workgroupSize: u32,
) -> void {
atomicStore(&queueSizes[1], 0);
atomicStore(&queueSizes[2], 0);

let size = atomicLoad(&queueSizes[0]);
outputBuffer[0] = u32( ceil( f32(size) / f32( workgroupSize ) ) );
outputBuffer[1] = 1;
outputBuffer[2] = 1;
}

` );

export const writeEscapedRayDispatchSize = wgslFn( /* wgsl */ `
fn writeTraceRayDispatchSize(
outputBuffer: ptr<storage, array<u32>, read_write>,

queueSizes: ptr<storage, array<atomic<u32>>, read_write>,
workgroupSize: u32,
) -> void {
let size = atomicLoad(&queueSizes[2]);
outputBuffer[0] = u32( ceil( f32(size) / f32( workgroupSize ) ) );
outputBuffer[1] = 1;
outputBuffer[2] = 1;
}

` );

export const writeBsdfDispatchSize = wgslFn( /* wgsl */ `
fn writeBsdfDispatchSize(
queueSizes: ptr<storage, array<atomic<u32>>, read_write>,
outputBuffer: ptr<storage, array<u32>, read_write>,
workgroupSize: u32
) -> void {

atomicStore(&queueSizes[0], 0);

let count = atomicLoad(&queueSizes[1]);
outputBuffer[0] = u32( ceil( f32(count) / f32( workgroupSize ) ) );
outputBuffer[1] = 1;
outputBuffer[2] = 1;
}
`, );

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these are just copying data to a indirect buffers so we can issue indirect compute commands, is that right? Is it the case that we can't use atomic storage buffers are indirect compute arguments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these are just copying data to a indirect buffers so we can issue indirect compute commands, is that right?

Yeah, we are writing the dispatch size in the indirect buffer.

Is it the case that we can't use atomic storage buffers are indirect compute arguments?

Well, we possibly could reuse storage buffer for atomics as one of the indirect buffers, but we cant get rid of them all. In the case of escaped rays and bsdf sampling we need two dispatch buffers and if we use queue sizes buffer for one of them, relevant values for other will be lost.

Theoretically we could create only one dispatch buffer for everything and dispatch with offsets. So values [0; 3) of the array will be dispatch size for escaped ray and [3; 6) - for bsdf sample kernel. But I think this is not supported in three.js right now

Comment on lines +54 to +58
// 2.1 Dispatch shaders
// When processing escaped rays, calculate new contribution and add that to the result image
_renderer.compute( this.escapedRayKernel, this.escapedRayDispatchBuffer );
// When processing hit results, sample a new ray according to material's properties
_renderer.compute( this.bsdfEvalKernel, this.bsdfDispatchBuffer );
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wonder if it's possible to merge this "escaped ray" kernel into one of the other ones?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible. However separation is deliberate and should help with thread utilization. The main reason why wavefront pathtracing would be faster than regular megakernel is more unified kernel execution duration and hence better thread utilization.

This was my thinking when writing this.

However I can see now, that the original article had a unified "Shade stage" for escaped and intersected rays. We could experiment with this in the future, once the performance baseline is established.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants