-
Notifications
You must be signed in to change notification settings - Fork 161
Add robustness to GPU shaders #537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,6 +53,7 @@ var<workgroup> sh_bitmaps: array<array<atomic<u32>, N_TILE>, N_SLICE>; | |
// store count values packed two u16's to a u32 | ||
var<workgroup> sh_count: array<array<u32, N_TILE>, N_SUBSLICE>; | ||
var<workgroup> sh_chunk_offset: array<u32, N_TILE>; | ||
var<workgroup> sh_atomic_failed: u32; | ||
|
||
@compute @workgroup_size(256) | ||
fn main( | ||
|
@@ -63,7 +64,18 @@ fn main( | |
for (var i = 0u; i < N_SLICE; i += 1u) { | ||
atomicStore(&sh_bitmaps[i][local_id.x], 0u); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this isn't related to this PR, but as far as I can tell, this is already guaranteed to be zeroed. If this is to work around a driver/naga bug, we should have a comment here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I didn't realize that was a strong guarantee. In WebGPU world, it's probably worth skipping this explicit zeroing, but in native world it might be worth compiling with zeroing by infrastructure disabled, in which case we would need this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point - for us it's not impactful, but e.g. before #363 this would have mattered for the MSL conversion |
||
} | ||
workgroupBarrier(); | ||
if local_id.x == 0u { | ||
let failed = bump.lines > config.lines_size; | ||
sh_atomic_failed = u32(failed); | ||
} | ||
// also functions as barrier to protect zeroing of bitmaps | ||
let failed = workgroupUniformLoad(&sh_atomic_failed); | ||
if failed != 0u { | ||
if global_id.x == 0u { | ||
DJMcNab marked this conversation as resolved.
Show resolved
Hide resolved
|
||
bump.failed |= STAGE_FLATTEN; | ||
} | ||
return; | ||
} | ||
|
||
// Read inputs and determine coverage of bins | ||
let element_ix = global_id.x; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,6 +73,9 @@ fn alloc_cmd(size: u32) { | |
let ptcl_dyn_start = config.width_in_tiles * config.height_in_tiles * PTCL_INITIAL_ALLOC; | ||
var new_cmd = ptcl_dyn_start + atomicAdd(&bump.ptcl, PTCL_INCREMENT); | ||
if new_cmd + PTCL_INCREMENT > config.ptcl_size { | ||
// This sets us up for technical UB, as lots of threads will be writing | ||
// to the same locations. But I think it's fine, and predicating the | ||
// writes would probably slow things down. | ||
Comment on lines
+76
to
+78
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be reasonable to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Intriguing idea! However, that won't quite avoid UB, as cmd_offset will edge into the allocation following this one. Setting it to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, because the allocations are variably sized? I can't say I'm that happy about adding UB, but I do agree that it's unlikely to cause a problem in practise. I wonder how bad the cost of writing to the same location is in terms of memory bandwidth/cache coherency? |
||
new_cmd = 0u; | ||
atomicOr(&bump.failed, STAGE_COARSE); | ||
} | ||
|
@@ -152,11 +155,16 @@ fn main( | |
// We need to check only prior stages, as if this stage has failed in another workgroup, | ||
// we still want to know this workgroup's memory requirement. | ||
if local_id.x == 0u { | ||
let failed = (atomicLoad(&bump.failed) & (STAGE_BINNING | STAGE_TILE_ALLOC | STAGE_FLATTEN)) != 0u | ||
|| atomicLoad(&bump.seg_counts) > config.seg_counts_size; | ||
// Reuse sh_part_count to hold failed flag, shmem is tight | ||
sh_part_count[0] = atomicLoad(&bump.failed); | ||
sh_part_count[0] = u32(failed); | ||
} | ||
let failed = workgroupUniformLoad(&sh_part_count[0]); | ||
if (failed & (STAGE_BINNING | STAGE_TILE_ALLOC | STAGE_PATH_COARSE)) != 0u { | ||
if failed != 0u { | ||
if wg_id.x == 0u && local_id.x == 0u { | ||
atomicOr(&bump.failed, STAGE_COARSE); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this represent? I would think this should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll make the change, now that I'm thinking these flags will be interpreted by the client, not just to early-out downstream. |
||
} | ||
return; | ||
} | ||
let width_in_bins = (config.width_in_tiles + N_TILE_X - 1u) / N_TILE_X; | ||
|
@@ -431,9 +439,11 @@ fn main( | |
} | ||
if bin_tile_x + tile_x < config.width_in_tiles && bin_tile_y + tile_y < config.height_in_tiles { | ||
ptcl[cmd_offset] = CMD_END; | ||
var blend_ix = 0u; | ||
if max_blend_depth > BLEND_STACK_SPLIT { | ||
let scratch_size = max_blend_depth * TILE_WIDTH * TILE_HEIGHT; | ||
ptcl[blend_offset] = atomicAdd(&bump.blend, scratch_size); | ||
blend_ix = atomicAdd(&bump.blend, scratch_size); | ||
} | ||
ptcl[blend_offset] = blend_ix; | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,8 +16,12 @@ let WG_SIZE = 256u; | |
|
||
@compute @workgroup_size(1) | ||
fn main() { | ||
let lines = atomicLoad(&bump.lines); | ||
indirect.count_x = (lines + (WG_SIZE - 1u)) / WG_SIZE; | ||
if atomicLoad(&bump.failed) != 0u { | ||
indirect.count_x = 0u; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm impressed that this works. Reading the specs suggest it's fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this works and it's the only way I'm aware of that allows you to "abort" this type of indirect dispatch (there are more sophisticated ways with bindless, see for example: https://developer.apple.com/documentation/metal/indirect_command_encoding/encoding_indirect_command_buffers_on_the_gpu?language=objc). Interestingly, I couldn't find any explicit wording in the WebGPU, Metal, Vulkan, or D3D12 docs that this is the expected behavior but "0" falls within the accepted range for all of them. See also this past discussion: gpuweb/gpuweb#1045 |
||
} else { | ||
let lines = atomicLoad(&bump.lines); | ||
indirect.count_x = (lines + (WG_SIZE - 1u)) / WG_SIZE; | ||
} | ||
indirect.count_y = 1u; | ||
indirect.count_z = 1u; | ||
} |
Uh oh!
There was an error while loading. Please reload this page.