feat: add configurable CPU offloading for quantized models#294
feat: add configurable CPU offloading for quantized models#294olekssy wants to merge 1 commit intop-e-w:masterfrom
Conversation
Adds the `cpu_offload` setting to allow enabling `llm_int8_enable_fp32_cpu_offload` in BitsAndBytesConfig, which helps when the model exceeds available VRAM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a cpu_offload setting to enable FP32 CPU offloading for quantized models. However, the implementation uses a parameter specific to 8-bit quantization while the current configuration uses 4-bit, making the change ineffective. Furthermore, the PR includes unrelated refactoring of LoRA target module identification and unnecessary whitespace changes, which violate the repository's style guide regarding single-purpose PRs. The new configuration setting is also missing from the default configuration file.
| bnb_4bit_compute_dtype=compute_dtype, | ||
| bnb_4bit_quant_type="nf4", | ||
| bnb_4bit_use_double_quant=True, | ||
| llm_int8_enable_fp32_cpu_offload=self.settings.cpu_offload, |
There was a problem hiding this comment.
The llm_int8_enable_fp32_cpu_offload parameter is specific to 8-bit quantization (load_in_8bit=True). It has no effect when using 4-bit quantization (load_in_4bit=True), which is the only method currently supported by the QuantizationMethod enum. Consequently, this change does not actually enable CPU offloading for the models used in this application. For 4-bit models, offloading is typically handled via device_map and max_memory configurations.
| cpu_offload: bool = Field( | ||
| default=False, | ||
| description="Whether to enable FP32 CPU offload for quantized models (suggested by transformers for VRAM issues).", | ||
| ) |
There was a problem hiding this comment.
The new cpu_offload setting is missing from config.default.toml. According to the repository style guide, all new settings added to config.py must also be added to config.default.toml with their default value and description, maintaining the same relative order.
References
- When new settings are added in config.py, they should also be added to config.default.toml, set to their default value and with their description as a comment. The order of settings in config.default.toml should match that in config.py. (link)
| module_id_to_full_name = { | ||
| id(module): module_name | ||
| for module_name, module in self.model.named_modules() | ||
| } | ||
|
|
||
| for layer_index in range(len(self.get_layers())): | ||
| for modules in self.get_layer_modules(layer_index).values(): | ||
| for module in modules: | ||
| if id(module) in module_id_to_leaf_name: | ||
| target_modules_set.add(module_id_to_leaf_name[id(module)]) | ||
| full_name = module_id_to_full_name.get(id(module)) | ||
| if full_name is not None: | ||
| target_modules_set.add(full_name) | ||
|
|
||
| target_modules = list(target_modules_set) | ||
| target_modules = sorted(target_modules_set) |
There was a problem hiding this comment.
This refactoring of the LoRA target module identification (switching from leaf names to full module paths) is semantically independent of the CPU offloading feature. The repository style guide requires that pull requests implement only one change and avoid unrelated modifications to existing code. This should be moved to a separate PR.
References
- Pull requests should implement one change, and one change only. PRs must not change existing code unless the changes are directly related to the PR. (link)
| skip_special_tokens=True, | ||
| ), | ||
| ) | ||
| ) No newline at end of file |
There was a problem hiding this comment.
This change in whitespace/newline at the end of the file is unrelated to the PR's objective and violates the repository style guide regarding unnecessary changes to existing code.
References
- PRs must not change existing code unless the changes are directly related to the PR. This includes changes to formatting and comments. (link)
|
PR closed due to incompatibility of the change with the quantization setup. |
Summary
Adds a
cpu_offloadsetting to allow enablingllm_int8_enable_fp32_cpu_offloadinBitsAndBytesConfig. This prevents crashes when a quantized model exceeds available VRAM by allowing offloading to CPU/RAM.Test plan
--cpu-offloadCLI flag is recognized.cpu_offload = trueinconfig.tomlis recognized.🤖 Generated with Claude Code