Skip to content

feat: add configurable CPU offloading for quantized models#294

Closed
olekssy wants to merge 1 commit intop-e-w:masterfrom
olekssy:feat/cpu-offload
Closed

feat: add configurable CPU offloading for quantized models#294
olekssy wants to merge 1 commit intop-e-w:masterfrom
olekssy:feat/cpu-offload

Conversation

@olekssy
Copy link
Copy Markdown

@olekssy olekssy commented Apr 11, 2026

Summary

Adds a cpu_offload setting to allow enabling llm_int8_enable_fp32_cpu_offload in BitsAndBytesConfig. This prevents crashes when a quantized model exceeds available VRAM by allowing offloading to CPU/RAM.

Test plan

  • Verify that --cpu-offload CLI flag is recognized.
  • Verify that cpu_offload = true in config.toml is recognized.
  • Load a model that exceeds VRAM and confirm it loads successfully with the flag enabled.

🤖 Generated with Claude Code

Adds the `cpu_offload` setting to allow enabling `llm_int8_enable_fp32_cpu_offload`
in BitsAndBytesConfig, which helps when the model exceeds available VRAM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cpu_offload setting to enable FP32 CPU offloading for quantized models. However, the implementation uses a parameter specific to 8-bit quantization while the current configuration uses 4-bit, making the change ineffective. Furthermore, the PR includes unrelated refactoring of LoRA target module identification and unnecessary whitespace changes, which violate the repository's style guide regarding single-purpose PRs. The new configuration setting is also missing from the default configuration file.

bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
llm_int8_enable_fp32_cpu_offload=self.settings.cpu_offload,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The llm_int8_enable_fp32_cpu_offload parameter is specific to 8-bit quantization (load_in_8bit=True). It has no effect when using 4-bit quantization (load_in_4bit=True), which is the only method currently supported by the QuantizationMethod enum. Consequently, this change does not actually enable CPU offloading for the models used in this application. For 4-bit models, offloading is typically handled via device_map and max_memory configurations.

Comment on lines +129 to +132
cpu_offload: bool = Field(
default=False,
description="Whether to enable FP32 CPU offload for quantized models (suggested by transformers for VRAM issues).",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new cpu_offload setting is missing from config.default.toml. According to the repository style guide, all new settings added to config.py must also be added to config.default.toml with their default value and description, maintaining the same relative order.

References
  1. When new settings are added in config.py, they should also be added to config.default.toml, set to their default value and with their description as a comment. The order of settings in config.default.toml should match that in config.py. (link)

Comment on lines +172 to +184
module_id_to_full_name = {
id(module): module_name
for module_name, module in self.model.named_modules()
}

for layer_index in range(len(self.get_layers())):
for modules in self.get_layer_modules(layer_index).values():
for module in modules:
if id(module) in module_id_to_leaf_name:
target_modules_set.add(module_id_to_leaf_name[id(module)])
full_name = module_id_to_full_name.get(id(module))
if full_name is not None:
target_modules_set.add(full_name)

target_modules = list(target_modules_set)
target_modules = sorted(target_modules_set)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This refactoring of the LoRA target module identification (switching from leaf names to full module paths) is semantically independent of the CPU offloading feature. The repository style guide requires that pull requests implement only one change and avoid unrelated modifications to existing code. This should be moved to a separate PR.

References
  1. Pull requests should implement one change, and one change only. PRs must not change existing code unless the changes are directly related to the PR. (link)

skip_special_tokens=True,
),
)
) No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change in whitespace/newline at the end of the file is unrelated to the PR's objective and violates the repository style guide regarding unnecessary changes to existing code.

References
  1. PRs must not change existing code unless the changes are directly related to the PR. This includes changes to formatting and comments. (link)

@olekssy olekssy closed this Apr 11, 2026
@olekssy
Copy link
Copy Markdown
Author

olekssy commented Apr 11, 2026

PR closed due to incompatibility of the change with the quantization setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant