Skip to content

Conversation

@uael
Copy link
Contributor

@uael uael commented Dec 11, 2025

Description
Allow setting retain_command_buffer_references back to false by deferring buffer and texture destroy if used by a command buffer. Replace #8694

Testing
CTS and tests

Checklist

  • Run cargo fmt.
  • Run taplo format.
  • Run cargo clippy --tests. If applicable, add:
    • --target wasm32-unknown-unknown
  • Run cargo xtask test to run tests.
  • If this contains user-facing changes, add a CHANGELOG.md entry.

@uael
Copy link
Contributor Author

uael commented Dec 11, 2025

CTS still hangs on Mac, I need to investigate this

@andyleiserson
Copy link
Contributor

The known cause of CTS hangs on Mac is #3084, although that doesn't totally make sense here, because the test list hasn't changed.

@uael
Copy link
Contributor Author

uael commented Dec 12, 2025

Is this approach the right path forward ? Let me add my measurements regarding the auto-retain or not flag and memory usage later today

@uael uael force-pushed the uael/command-buffer-in-flight branch from f44dcdb to fe9eb2d Compare December 12, 2025 06:51
@uael
Copy link
Contributor Author

uael commented Dec 12, 2025

@andyleiserson rebased to get your CTS filter changes, it's now hanging with 'webgpu:api,validation,encoding,cmds,copyTextureToTexture:texture_format_compatibility:*', I just tested locally and they all passes 2327 / 2843, other are skipped. Any idea where I should have look first ?

@andyleiserson
Copy link
Contributor

I'm not sure about the approach. There is already a mechanism to defer destruction for buffers/textures that are in use -- see the code at the end of Buffer::destroy and Texture::destroy. There is a tension between safely keeping resources alive when needed, and enabling applications to destroy them immediately when desired to recover memory. The case where a resource is referenced in a command buffer, then destroyed before that command buffer is submitted, is intended to be supported (there is a check on submit that resources have not been destroyed).

It may be useful to look at #8129 and the "full set of changes" referenced in the description, in particular 42e4a04. This was a (never merged) attempt to address the resource lifetime problem a different way closer to what you are doing, where the resources would be kept alive via Arc references in the tracker and recovered only once all the references went away. It changed destroy to replace the Arc reference held in the hub with a tombstone, allowing the resource to be destroyed at that point if there weren't other references in trackers keeping it alive.

I don't think it should be necessary to have both the in_flight_count mechanism and the schedule_resource_destruction mechanism for managing lifetime.

I would still be interested to see a test case. We have made changes over the past few months (in particular, encoding on finish) that should have made it harder to destroy resources and then still end up trying to use a command buffer that references them. The exception I know of is #7816, I believe (other than setting the retain references flag) that issue is still outstanding and I'm not aware of a strategy for fixing it besides keeping all resources alive whenever they are referenced in a command buffer. But I think that might be a Metal bug, and I don't know if we want to remove the ability to eagerly destroy resources entirely if we can get it fixed in Metal.

@uael
Copy link
Contributor Author

uael commented Dec 16, 2025

Thank you for the detailed explanation. I clearly didn't had enough context when approaching this but what I'm sure about is that #7816 doesn't happen anymore with this proposed approach (and the previous closed one), even with unretained references (without #7842).

I would still be interested to see a test case.

Unfortunately all I have is memory measurements from our iOS app and #7842 is definitely the culprit: memory indefinitely accumulate when this specific commit is cherry-picked. But that's on v25.

I'm not sure about the approach.

In the end I'm not even looking for this exact proposed approach/behavior, I was just trying to get #7816 fixed without the need for #7842. Hopping that it can fix the following as well:

-[MTLDebugDevice notifyExternalReferencesNonZeroOnDealloc:]:3459: failed assertion `The following Metal object is being destroyed while still required to be alive by the command buffer 0x1220c7a00 (label: (wgpu internal) Signal):
<MTLToolsObject: 0x600002b02450> -> <MTLSimBuffer: 0x60000382c300>
    label = Render Pass Vertex Buffer 
    length = 36 
    cpuCacheMode = MTLCPUCacheModeDefaultCache 
    storageMode = MTLStorageModePrivate 
    hazardTrackingMode = MTLHazardTrackingModeTracked 
    resourceOptions = MTLResourceCPUCacheModeDefaultCache MTLResourceStorageModePrivate MTLResourceHazardTrackingModeTracked  
    purgeableState = MTLPurgeableStateNonVolatile'
CoreSimulator 1048 - Device: iPad mini (A17 Pro) (395B57A4-D87A-4845-90CB-168FA4CE7140) - Runtime: iOS 26.0 (23A339) - DeviceType: iPad mini (A17 Pro)
Can't show file for stack frame : <DBGLLDBStackFrame: 0x827be4000> - stackNumber:12 - name:core::ptr::drop_in_place$LT$wgpu_hal..metal..Buffer$GT$::h4d4355beee563fc7 [inlined]. The file path does not exist on the file system: /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/core/src/ptr/mod.rs

Let me re-run my measurements against trunk instead of v25, with and without #7842.

But I think that might be a Metal bug, and I don't know if we want to remove the ability to eagerly destroy resources entirely if we can get it fixed in Metal.

This appear to be true, but the validation error above make me think that something might still be wrong in wgpu tracking regarding Metal expected behavior. But again, that was on v25, do you recall any recent change that could have fixed the above ?

@uael
Copy link
Contributor Author

uael commented Dec 17, 2025

Closing as I'm unable to reproduce on trunk, sounds like it has been fixed already. I should have started from there directly - sorry for the time loss.

@uael uael closed this Dec 17, 2025
@andyleiserson
Copy link
Contributor

No worries. I do think that turning the retained references flag back off is desirable -- I just haven't had time to dig into it.

Re: the MTLDebugDevice error, a lot has changed since v25 aimed at resolving this kind of issue, so I wouldn't be surprised if it has been resolved, but I can't say anything for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants