Qwen3-235B-A22B OOMs with sufficient amount of VRAM

We're encountering an out-of-memory (OOM) error on a 2TB machine when attempting to load `Qwen3-235B-A22B`.

The issue seems to be that `torchtune` loads the full, unsharded checkpoint with `self._checkpoint_client.load_base_checkpoint()`. For a `bf16` model of this size, the complete weights (~3.7TB) exceed the available memory. Loading the model in shards would likely resolve this.

Has anyone else encountered this?

_Originally posted by @leng-yue in https://github.com/pytorch/torchtune/issues/2867#issuecomment-3046227062_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-235B-A22B OOMs with sufficient amount of VRAM #2886

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3-235B-A22B OOMs with sufficient amount of VRAM #2886

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions