feat: add configurable CPU offloading for quantized models by olekssy · Pull Request #294 · p-e-w/heretic

olekssy · 2026-04-11T22:44:34Z

Summary

Adds a cpu_offload setting to allow enabling llm_int8_enable_fp32_cpu_offload in BitsAndBytesConfig. This prevents crashes when a quantized model exceeds available VRAM by allowing offloading to CPU/RAM.

Test plan

Verify that --cpu-offload CLI flag is recognized.
Verify that cpu_offload = true in config.toml is recognized.
Load a model that exceeds VRAM and confirm it loads successfully with the flag enabled.

🤖 Generated with Claude Code

Adds the `cpu_offload` setting to allow enabling `llm_int8_enable_fp32_cpu_offload` in BitsAndBytesConfig, which helps when the model exceeds available VRAM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a cpu_offload setting to enable FP32 CPU offloading for quantized models. However, the implementation uses a parameter specific to 8-bit quantization while the current configuration uses 4-bit, making the change ineffective. Furthermore, the PR includes unrelated refactoring of LoRA target module identification and unnecessary whitespace changes, which violate the repository's style guide regarding single-purpose PRs. The new configuration setting is also missing from the default configuration file.

gemini-code-assist · 2026-04-11T22:45:45Z

src/heretic/model.py

                bnb_4bit_compute_dtype=compute_dtype,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True,
+                llm_int8_enable_fp32_cpu_offload=self.settings.cpu_offload,


The llm_int8_enable_fp32_cpu_offload parameter is specific to 8-bit quantization (load_in_8bit=True). It has no effect when using 4-bit quantization (load_in_4bit=True), which is the only method currently supported by the QuantizationMethod enum. Consequently, this change does not actually enable CPU offloading for the models used in this application. For 4-bit models, offloading is typically handled via device_map and max_memory configurations.

gemini-code-assist · 2026-04-11T22:45:45Z

src/heretic/config.py

+    cpu_offload: bool = Field(
+        default=False,
+        description="Whether to enable FP32 CPU offload for quantized models (suggested by transformers for VRAM issues).",
+    )


The new cpu_offload setting is missing from config.default.toml. According to the repository style guide, all new settings added to config.py must also be added to config.default.toml with their default value and description, maintaining the same relative order.

References

When new settings are added in config.py, they should also be added to config.default.toml, set to their default value and with their description as a comment. The order of settings in config.default.toml should match that in config.py. ^(link)

gemini-code-assist · 2026-04-11T22:45:45Z

src/heretic/model.py

+        module_id_to_full_name = {
+            id(module): module_name
+            for module_name, module in self.model.named_modules()
+        }

+        for layer_index in range(len(self.get_layers())):
            for modules in self.get_layer_modules(layer_index).values():
                for module in modules:
-                    if id(module) in module_id_to_leaf_name:
-                        target_modules_set.add(module_id_to_leaf_name[id(module)])
+                    full_name = module_id_to_full_name.get(id(module))
+                    if full_name is not None:
+                        target_modules_set.add(full_name)

-        target_modules = list(target_modules_set)
+        target_modules = sorted(target_modules_set)


This refactoring of the LoRA target module identification (switching from leaf names to full module paths) is semantically independent of the CPU offloading feature. The repository style guide requires that pull requests implement only one change and avoid unrelated modifications to existing code. This should be moved to a separate PR.

References

Pull requests should implement one change, and one change only. PRs must not change existing code unless the changes are directly related to the PR. ^(link)

gemini-code-assist · 2026-04-11T22:45:45Z

src/heretic/model.py

                skip_special_tokens=True,
            ),
-        )
+        )


This change in whitespace/newline at the end of the file is unrelated to the PR's objective and violates the repository style guide regarding unnecessary changes to existing code.

References

PRs must not change existing code unless the changes are directly related to the PR. This includes changes to formatting and comments. ^(link)

olekssy · 2026-04-11T22:50:12Z

PR closed due to incompatibility of the change with the quantization setup.

feat: add configurable CPU offloading for quantized models

c3f111b

Adds the `cpu_offload` setting to allow enabling `llm_int8_enable_fp32_cpu_offload` in BitsAndBytesConfig, which helps when the model exceeds available VRAM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Apr 11, 2026

View reviewed changes

olekssy closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add configurable CPU offloading for quantized models#294

feat: add configurable CPU offloading for quantized models#294
olekssy wants to merge 1 commit intop-e-w:masterfrom
olekssy:feat/cpu-offload

olekssy commented Apr 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

olekssy commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

olekssy commented Apr 11, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

olekssy commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant