-
-
Notifications
You must be signed in to change notification settings - Fork 148
[WIP] Webgpu pathtracer prototype #705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nu at the bottom.
# Conflicts: # package-lock.json # package.json
|
Amazing work! Thanks for putting the time in. In terms of getting an initial version merged I think these are some good goals. I think this should help get the general architecture defined and answer some of the open questions. I think your other TODOs can be added in subsequent PRs which will make things easier to review:
In terms of making this easier to understand - do you mind splitting some of the code out into separate files? The "PathTracerCore" file is pretty monolithic and difficult to follow at the moment. I think these changes will make it easier to give some feedback:
And some other comments:
It would be helpful to know what data is needed at each step of the path tracer. But organizing things from the beginning so that tiled rendering is supported I think is our best bet. Once three.js supports requesting larger buffers users can use that but we'll still need a graceful fallback either way.
Let's get to a point where we have a full resolution scene rendering and we can take a look at the pros / cons of the megakernal and wavefront approach. The wavefront approach may also wind up having more of a noticeable improvement later on when we can terminate rays early and / or sort them and therefore avoid running large blocks of work. Also just a note that the "megakernel" path seems to not be working for me. I see this error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for organizing things - it helps quite a bit. I've just taken a deeper look at things and I think I'm understanding how things are roughly working, now. I've added a few initial questions but excuse me if I've asked anything obvious as I wrap my head around things 😅
| ` ); | ||
|
|
||
| export const pcgInit = wgslFn( /* wgsl */` | ||
| fn pcg_initialize(state: ptr<function, PcgState>, p: vec2u, frame: u32) -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is there a reason we can't store the "state" variable as a global variable? It seems more simple than passing a state argument in every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to think how to do that in nodes system initially. Will take a look if that's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I was just curious if there was a reason it couldn't be done.
|
|
||
| export const resetFn = wgslFn( /* wgsl */ ` | ||
|
|
||
| fn reset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we name this function and comment it so it's more clear what it's resetting? What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shader resets result and sample count buffers, so any accumulated color stored becomes and sample count becomes 0. Will adjust name to better reflect the meaning
| export const lambertBsdfFunc = wgslFn( /* wgsl */` | ||
| fn bsdfEval(rngState: ptr<function, PcgState>, normal: vec3f, view: vec3f) -> ScatterRecord { | ||
|
|
||
| const PI: f32 = 3.141592653589793; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably move this to a constants node.
| 1, | ||
| ]; | ||
| // 0. Clean queues | ||
| _renderer.compute( this.cleanQueuesKernel, 1 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we want to / need to clear these kernels? The buffer size limits may make this more complicated but I was originally thinking of these queues as ongoing "work queues" that would have work constantly pushed onto them over multiple frames. Rays from still-unterminated paths from sample 1 could still be in the queue while rays from sample 2 (or 3 or 4) are being pushed onto the same queue, for example. Then there can always be work available on the queue for the compute nodes to pick up.
"Finishing" a render could mean a few things:
- The maximum samples have been reached for every pixel and the last ray from the queue is complete.
- Some pixels can be terminated early if they are determined to already be "resolved" enough (see Add ability to report noise on each pass, allows halting on a noise threshold #198)
Small tiles can be run to start off new paths and fill up the queue as needed. This would also let users really tune in how much work they want to do per frame - rather than samples the use can say "only process X rays per frame" and that can be something that's auto-tunable.
And maybe a separate thought but I'm also wondering if sorting rays can somehow help with performance and thread utilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than samples the use can say "only process X rays per frame" and that can be something that's auto-tunable.
I see. This sounds interesting and promising in terms of adapting to hardware's capabilities. However, I think this should be done later, after the essential stuff is setup. Right now, ray generation kernel fully loads the ray queue with rays without checking for overflows so we have to clean the queues.
And maybe a separate thought but I'm also wondering if sorting rays can somehow help with performance and thread utilization.
This is an interesting idea to explore. I can see how sorting ray in a certain way can improve thread utilization. Certainly a topic to explore.
Perhaps we should create issues for those and return later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can take a look at these later. I added some small notes to #692.
rather than samples the use can say "only process X rays per frame" and that can be something that's auto-tunable.
I see. This sounds interesting and promising in terms of adapting to hardware's capabilities. However, I think this should be done later
Once the resolution of the render in the demo is right and we can get this merged I'll take a look at this if that's okay. I'm pretty interested in it
| export const writeTraceRayDispatchSize = wgslFn( /* wgsl */ ` | ||
| fn writeTraceRayDispatchSize( | ||
| outputBuffer: ptr<storage, array<u32>, read_write>, | ||
|
|
||
| queueSizes: ptr<storage, array<atomic<u32>>, read_write>, | ||
|
|
||
| workgroupSize: u32, | ||
| ) -> void { | ||
| atomicStore(&queueSizes[1], 0); | ||
| atomicStore(&queueSizes[2], 0); | ||
|
|
||
| let size = atomicLoad(&queueSizes[0]); | ||
| outputBuffer[0] = u32( ceil( f32(size) / f32( workgroupSize ) ) ); | ||
| outputBuffer[1] = 1; | ||
| outputBuffer[2] = 1; | ||
| } | ||
|
|
||
| ` ); | ||
|
|
||
| export const writeEscapedRayDispatchSize = wgslFn( /* wgsl */ ` | ||
| fn writeTraceRayDispatchSize( | ||
| outputBuffer: ptr<storage, array<u32>, read_write>, | ||
|
|
||
| queueSizes: ptr<storage, array<atomic<u32>>, read_write>, | ||
| workgroupSize: u32, | ||
| ) -> void { | ||
| let size = atomicLoad(&queueSizes[2]); | ||
| outputBuffer[0] = u32( ceil( f32(size) / f32( workgroupSize ) ) ); | ||
| outputBuffer[1] = 1; | ||
| outputBuffer[2] = 1; | ||
| } | ||
|
|
||
| ` ); | ||
|
|
||
| export const writeBsdfDispatchSize = wgslFn( /* wgsl */ ` | ||
| fn writeBsdfDispatchSize( | ||
| queueSizes: ptr<storage, array<atomic<u32>>, read_write>, | ||
| outputBuffer: ptr<storage, array<u32>, read_write>, | ||
| workgroupSize: u32 | ||
| ) -> void { | ||
|
|
||
| atomicStore(&queueSizes[0], 0); | ||
|
|
||
| let count = atomicLoad(&queueSizes[1]); | ||
| outputBuffer[0] = u32( ceil( f32(count) / f32( workgroupSize ) ) ); | ||
| outputBuffer[1] = 1; | ||
| outputBuffer[2] = 1; | ||
| } | ||
| `, ); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these are just copying data to a indirect buffers so we can issue indirect compute commands, is that right? Is it the case that we can't use atomic storage buffers are indirect compute arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these are just copying data to a indirect buffers so we can issue indirect compute commands, is that right?
Yeah, we are writing the dispatch size in the indirect buffer.
Is it the case that we can't use atomic storage buffers are indirect compute arguments?
Well, we possibly could reuse storage buffer for atomics as one of the indirect buffers, but we cant get rid of them all. In the case of escaped rays and bsdf sampling we need two dispatch buffers and if we use queue sizes buffer for one of them, relevant values for other will be lost.
Theoretically we could create only one dispatch buffer for everything and dispatch with offsets. So values [0; 3) of the array will be dispatch size for escaped ray and [3; 6) - for bsdf sample kernel. But I think this is not supported in three.js right now
| // 2.1 Dispatch shaders | ||
| // When processing escaped rays, calculate new contribution and add that to the result image | ||
| _renderer.compute( this.escapedRayKernel, this.escapedRayDispatchBuffer ); | ||
| // When processing hit results, sample a new ray according to material's properties | ||
| _renderer.compute( this.bsdfEvalKernel, this.bsdfDispatchBuffer ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wonder if it's possible to merge this "escaped ray" kernel into one of the other ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible. However separation is deliberate and should help with thread utilization. The main reason why wavefront pathtracing would be faster than regular megakernel is more unified kernel execution duration and hence better thread utilization.
This was my thinking when writing this.
However I can see now, that the original article had a unified "Shade stage" for escaped and intersected rays. We could experiment with this in the future, once the performance baseline is established.
Hi @gkjohnson! Here's a simple lambertian pathtracer. Both megakernel and wavefront, as discussed in #692. See webgpu_primitives demo. There is still a lot of work to be done here. I will try to keep an up-to-date todo list here so you could follow the progress.
At what point do you propose measuring performance?
I think it is not so difficult to have both megakernel and wavefront pathtracers as they share the essential code (for now at least). And it could be useful to have a point of reference when implementing additional features. And I'm not sure how relative performance will scale with features.
In webgpu by default we can bind maximum 128Mb buffers and there is no way to request bigger limits in three.js. So that leaves us with max tile resolution of 1920x1080 (which I hardcoded in the example for now). I'm wondering whether requesting bigger webgpu limits or features for that matter is a worthwhile feature for three.js. What do you think?
This buffer limit also limits the scene size at least until the TLAS is deployed and tracing is per-mesh instead of on the whole scene.
Also by default we can bind a maximum of 8 buffers for use which is a bottleneck for traceRay kernel. It needs a lot of buffers so I had to move reading material index into the next kernel (bsdf eval phase). This could be fixed if three.js had support for array types in structs and top-level storage buffer backed structs. At least I could not get it to work, so I had to put queue sizes into a separate buffer and pass separately. I also wonder to write an issue about the limitation in three.js but maybe I just didn't find it.
What's your policy on three.js version for webgpu release here? Ideally we would use r181 as it includes support for indirect compute (which is essential for wavefront pathtracing to not read-back values from GPU between calls). But it released like 5 days ago :) For now I copy-pasted compute methods from WebGpuRenderer and WebGpuBackend with little tweaks in order to compute indirectly. see computeIndirect and computeBackendIndirect functions in PathTracerCore.Just saw that the library recently upgraded to r181 - perfect!
Would love any feedback!
For later: