Optimizing 14GB Model on 4GB VRAM by agentifyanchor · Pull Request #16 · FlashLabs-AI-Corp/FlashLabs-Chroma

agentifyanchor · 2026-01-24T19:54:34Z

Proposal: Low-VRAM Inference Script (4GB GPU Support)

Hi everyone! 👋

I managed to run Chroma-4B successfully on a consumer laptop GPU (RTX 3050 Ti 4GB) using 4-bit quantization (bitsandbytes) and careful memory offloading.

Performance:

RTF (Real-Time Factor): ~2.60x (on 4GB VRAM + 8GB RAM Offload)
Stability: Rock solid (no OOM crashes).

I realized many developers might be struggling with the 14GB VRAM requirement, so I created a clean, minimal "Walkie-Talkie" script to demonstrate how to run this locally.

Included Files:

local_voice_chat.py: A clean, light script for "Talk & Listen" interaction.
- Uses direct memory playback (no disk I/O latency).
- Robust fallback if no custom voice prompt is found.
local_voice_chat_with_telemetry.py: Adds performance metrics (TF, RTF, Input/Output Latency).

I would love to contribute these as examples under local_run to help the community access this amazing model on lower-end hardware.

Best regards,
Ilyes .M
A Fan of Chroma! 🚀

kaishen-Dotc · 2026-01-28T08:22:57Z

Hi @agentifyanchor, thanks for this excellent contribution! 🎉
The 4-bit quantization approach and memory offloading strategy look great, and this will definitely help users with limited VRAM.
Before we merge, could you please verify compatibility with transformers 5.0.0? We've noticed that different transformers versions may require adjustments in the modeling code (see Chroma-SGLang for reference which can work in the virsion of 4.57.1).
Test environment:

torch >= 2.7.0
transformers == 5.0.0
accelerate >= 1.7.0

If you encounter any issues with transformers 5.0.0, please let us know the specific errors so we can work on compatibility fixes together.
Thanks again for your contribution to the community!

updated requirements to reflect these tested versions

agentifyanchor

updated requirements.txt to reflect these tested versions

agentifyanchor · 2026-01-28T23:02:41Z

Hi @agentifyanchor, thanks for this excellent contribution! 🎉 The 4-bit quantization approach and memory offloading strategy look great, and this will definitely help users with limited VRAM. Before we merge, could you please verify compatibility with transformers 5.0.0? We've noticed that different transformers versions may require adjustments in the modeling code (see Chroma-SGLang for reference which can work in the virsion of 4.57.1). Test environment:

torch >= 2.7.0 transformers == 5.0.0 accelerate >= 1.7.0

If you encounter any issues with transformers 5.0.0, please let us know the specific errors so we can work on compatibility fixes together. Thanks again for your contribution to the community!

Hi @kaishen-Dotc, thank you for the feedback!

I have verified the compatibility with transformers 5.0.0 as requested. I run a test using those versions:

Transformers: 5.0.0rc0
Torch: 2.6.0+cu124 (Trusted on Latest Stable. 2.7.0 seems to be Nightly, so I stuck to Stable for reliability).
Accelerate: 1.12.0 (Verified > 1.7.0).

The model loads correctly with 4-bit quantization and inference works. I ran the full loop (ASR -> Text Generation -> Audio Generation).

I’ve also updated the requirements.txt in the PR to reflect the tested versions.

I attached a screenshot as ref to the local run.

Optimizing 14GB Model on 4GB VRAM

49b4b81

Update package versions in requirements.txt

97ee9c5

updated requirements to reflect these tested versions

agentifyanchor commented Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing 14GB Model on 4GB VRAM#16

Optimizing 14GB Model on 4GB VRAM#16
agentifyanchor wants to merge 2 commits intoFlashLabs-AI-Corp:mainfrom
agentifyanchor:main

agentifyanchor commented Jan 24, 2026 •

edited

Loading

Uh oh!

kaishen-Dotc commented Jan 28, 2026

Uh oh!

agentifyanchor left a comment

Uh oh!

agentifyanchor commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agentifyanchor commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal: Low-VRAM Inference Script (4GB GPU Support)

Included Files:

Uh oh!

kaishen-Dotc commented Jan 28, 2026

Uh oh!

agentifyanchor left a comment

Choose a reason for hiding this comment

Uh oh!

agentifyanchor commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agentifyanchor commented Jan 24, 2026 •

edited

Loading