Fix Z-Image VAE operations not reserving working memory for OOM prevention by Copilot · Pull Request #8 · lstein/InvokeAI

Copilot · 2026-01-04T22:31:34Z

Fix VRAM cache calculation to properly account for device_working_mem_gb

Problem Analysis

When generating with larger models (like Q8 Z-Image Turbo), users get OOM errors during VAE decoding.

Root Cause:
The ZImageLatentsToImageInvocation and ZImageImageToLatentsInvocation do not request additional working memory for VAE operations, unlike the standard SD1.5/SDXL/SD3/CogView4 invocations. This means the model cache doesn't offload enough models from VRAM before VAE operations run, leaving no room for the operation's intermediate tensors.

Comparison:

✅ LatentsToImageInvocation (SD1.5/SDXL): Calls estimate_vae_working_memory_sd15_sdxl() and passes working_mem_bytes to model_on_device()
✅ SD3LatentsToImageInvocation: Calls estimate_vae_working_memory_sd3() and passes working_mem_bytes
✅ CogView4LatentsToImageInvocation: Calls estimate_vae_working_memory_cogview4() and passes working_mem_bytes
❌ ZImageLatentsToImageInvocation: Didn't estimate or request working memory (NOW FIXED ✅)
❌ ZImageImageToLatentsInvocation: Didn't estimate or request working memory (NOW FIXED ✅)

Changes Made

Technical Details

The fix adds working memory estimation to both Z-Image VAE invocations:

Detects whether the VAE is FLUX (FluxAutoEncoder) or Diffusers (AutoencoderKL)
Calls the appropriate estimation function:
- estimate_vae_working_memory_flux() for FLUX VAE
- estimate_vae_working_memory_sd3() for AutoencoderKL
Passes the estimated working memory to model_on_device(working_mem_bytes=...)

This ensures the model cache properly offloads models to make room for VAE operations before they run, preventing OOM errors.

Test Fixes

Fixed unit test issue where config attribute was being set on FluxAutoEncoder mock, which doesn't have this attribute. The test now only sets config attributes for AutoencoderKL VAEs.
Fixed Pydantic validation error by using model_construct() instead of the regular constructor to create invocation instances with mock fields, bypassing validation while still testing the core logic.

Files Modified

invokeai/app/invocations/z_image_latents_to_image.py: Added working memory estimation for decode
invokeai/app/invocations/z_image_image_to_latents.py: Added working memory estimation for encode
tests/app/invocations/test_z_image_working_memory.py: Added tests to verify working memory estimation

Expected Impact

Users will no longer need to manually set max_cache_vram_gb to work around OOM errors. The device_working_mem_gb setting (default 3GB) will now work correctly for Z-Image models, as the VAE operations will request appropriate working memory and the model cache will offload models accordingly.

Original prompt

This section details on the original issue you should resolve

<issue_title>[bug]: Out of Memory errors with larger models</issue_title>
<issue_description>### Is there an existing issue for this problem?

I have searched the existing issues

Install method

Invoke's Launcher

Operating system

Linux

GPU vendor

Nvidia (CUDA)

GPU model

RTX 4070

GPU VRAM

12GB

Version number

6.10.0rc2

Browser

No response

System Information

No response

What happened

When generating with the Q8 Z-Image Turbo model, I am getting out of memory errors during the VAE decoding phase. I can avoid the errors by setting max_cache_vram_gb to 4 GB, at which point I see VRAM memory use rise to ~4 GB. However it doesn't seem intuitive to me that adjusting the VRAM cache should be the way to fix the error.

I also tried setting device_working_mem_gb: 4 in my config file, but without the VRAM cache setting, I again get OOM.

Here is the log from a successful generation with the VRAM cache limited to 4 GB:

[2026-01-04 13:56:33,447]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 4096.00 MB. Heuristics applied: [1, 2].
[2026-01-04 13:56:33,521]::[InvokeAI]::INFO --> Invoke running on http://127.0.0.1:9090 (Press CTRL+C to quit)
[2026-01-04 13:57:16,700]::[InvokeAI]::INFO --> Executing queue item 124, session 59a3e7ce-10c9-4f8f-8243-e42ac139e7b8
C:\DWR\gits\invoke-20251225\InvokeAI\invokeai\backend\quantization\gguf\loaders.py:43: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavio
r. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch
\csrc\utils\tensor_numpy.cpp:209.)
  torch_tensor = torch.from_numpy(tensor.data)
[2026-01-04 13:57:24,816]::[Qwen3EncoderGGUFLoader]::INFO --> Detected llama.cpp GGUF format, converting keys to PyTorch format
[2026-01-04 13:57:24,818]::[Qwen3EncoderGGUFLoader]::INFO --> Qwen3 GGUF Encoder config detected: layers=36, hidden=2560, heads=32, kv_heads=8, intermediate=9728, head_dim=128
[2026-01-04 13:57:25,834]::[Qwen3EncoderGGUFLoader]::INFO --> Dequantized embed_tokens weight for embedding lookups
[2026-01-04 13:57:25,835]::[Qwen3EncoderGGUFLoader]::INFO --> Tied lm_head.weight to embed_tokens.weight (GGUF tied embeddings)
[2026-01-04 13:57:28,880]::[ModelManagerService]::INFO --> [MODEL CACHE] Locking model cache entry cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder (Type: Qwen3ForCausalLM), but it has already been dropped from the RAM cache. This is a sign that the model
 loading order is non-optimal in the invocation code (See https://github.com/invoke-ai/InvokeAI/issues/7513).
[2026-01-04 13:57:30,223]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder' (Qwen3ForCausalLM) onto cuda device in 1.34s. Total model size: 4326.88MB, VRAM: 3585.01MB (82.9%)
[2026-01-04 13:57:30,224]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:tokenizer' (Qwen2TokenizerFast) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
[2026-01-04 13:57:31,127]::[ModelManagerService]::INFO --> [MODEL CACHE] Unlocking model cache entry cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder (Type: Qwen3ForCausalLM), but it has already been dropped from the RAM cache. This is a sign that the mod
el loading order is non-optimal in the invocation code (See https://github.com/invoke-ai/InvokeAI/issues/7513).
[2026-01-04 13:57:39,397]::[Qwen3EncoderGGUFLoader]::INFO --> Detected llama.cpp GGUF format, converting keys to PyTorch format
[2026-01-04 13:57:39,398]::[Qwen3EncoderGGUFLoader]::INFO --> Qwen3 GGUF Encoder config detected: layers=36, hidden=2560, heads=32, kv_heads=8, intermediate=9728, head_dim=128
[2026-01-04 13:57:40,505]::[Qwen3EncoderGGUFLoader]::INFO --> Dequantized embed_tokens weight for embedding lookups
[2026-01-04 13:57:40,505]::[Qwen3EncoderGGUFLoader]::INFO --> Tied lm_head.weight to embed_tokens.weight (GGUF tied embeddings)
[2026-01-04 13:57:41,865]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:text_encoder' (Qwen3ForCausalLM) onto cuda device in 1.31s. Total model size: 4326.88MB, VRAM: 3585.01MB (82.9%)
[2026-01-04 13:57:41,866]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model 'cc563632-c564-42b4-abd6-64fee52df1ab:tokenizer' (Qwen2TokenizerFast) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)
[2026-01-04 13:57:52,471]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '8f5ce803-d406-403b-babf-490a231dcdbc:transformer' (ZImageTransformer2DModel) onto cuda device in 1.28s. Total mo...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes lstein/InvokeAI#7

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/lstein/InvokeAI/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

lstein · 2026-01-05T00:51:19Z

@copilot This error is occurring in the unit test:

FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[AutoEncoder] - AttributeError: Mock object has no attribute 'config'

…utoEncoder Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Copilot · 2026-01-05T00:54:14Z

@copilot This error is occurring in the unit test:

FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[Aut...

Fixed in commit e13401b. The issue was that FluxAutoEncoder doesn't have a config attribute (only AutoencoderKL does). Updated the test to only set config attributes when testing with AutoencoderKL, matching the actual code behavior where config.scaling_factor and config.shift_factor are only accessed for AutoencoderKL VAEs.

lstein · 2026-01-05T01:01:29Z

@copilot Still failures on the python tests:

=========================== short test summary info ============================
FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[AutoencoderKL] - pydantic_core._pydantic_core.ValidationError: 2 validation errors for ZImageLatentsToImageInvocation
latents
  Input should be a valid dictionary or instance of LatentsField [type=model_type, input_value=<MagicMock id='140315402449072'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
vae
  Input should be a valid dictionary or instance of VAEField [type=model_type, input_value=<MagicMock id='140315402451952'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
FAILED tests/app/invocations/test_z_image_working_memory.py::TestZImageWorkingMemory::test_z_image_latents_to_image_requests_working_memory[AutoEncoder] - pydantic_core._pydantic_core.ValidationError: 2 validation errors for ZImageLatentsToImageInvocation
latents
  Input should be a valid dictionary or instance of LatentsField [type=model_type, input_value=<MagicMock id='140315406743648'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
vae
  Input should be a valid dictionary or instance of VAEField [type=model_type, input_value=<MagicMock id='140315402388832'>, input_type=MagicMock]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
= 2 failed, 649 passed, 100 skipped, 3 deselected, 10 warnings in 80.41s (0:01:20) =
Error: Process completed with exit code 1.

…fields Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Initial plan

afc6abc

Copilot AI assigned Copilot and lstein Jan 4, 2026

Copilot started work on behalf of lstein January 4, 2026 22:32 View session

Copilot AI and others added 2 commits January 4, 2026 22:47

Fix Z-Image VAE encode/decode to request working memory

7d22505

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Add test for Z-Image working memory estimation

462182f

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix out of memory errors with larger models~~ Fix Z-Image VAE operations not reserving working memory for OOM prevention Jan 4, 2026

Copilot AI requested a review from lstein January 4, 2026 22:53

Copilot finished work on behalf of lstein January 4, 2026 22:53

github-actions bot added python python-tests invocations labels Jan 4, 2026

Copilot started work on behalf of lstein January 5, 2026 00:51 View session

Fix unit test: only set config attribute for AutoencoderKL, not FluxA…

e13401b

…utoEncoder Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Copilot finished work on behalf of lstein January 5, 2026 00:55

style: fix ruff errors

b9bc915

Copilot started work on behalf of lstein January 5, 2026 01:02 View session

Fix test: use model_construct to bypass Pydantic validation for mock …

3205bf6

…fields Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

Copilot finished work on behalf of lstein January 5, 2026 01:07

chore(ruff): fix ruff errors

47957c0

lstein marked this pull request as ready for review January 5, 2026 03:14

lstein merged commit 4201d4c into lstein-master Jan 5, 2026
19 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Z-Image VAE operations not reserving working memory for OOM prevention#8

Fix Z-Image VAE operations not reserving working memory for OOM prevention#8
lstein merged 7 commits intolstein-masterfrom
copilot/fix-out-of-memory-errors

Copilot AI commented Jan 4, 2026 •

edited

Loading

Uh oh!

lstein commented Jan 5, 2026

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

lstein commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix VRAM cache calculation to properly account for device_working_mem_gb

Problem Analysis

Changes Made

Technical Details

Test Fixes

Files Modified

Expected Impact

Install method

Operating system

GPU vendor

GPU model

GPU VRAM

Version number

Browser

System Information

What happened

Uh oh!

lstein commented Jan 5, 2026

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

lstein commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 4, 2026 •

edited

Loading