Skip to content

Optimizing 14GB Model on 4GB VRAM#16

Open
agentifyanchor wants to merge 2 commits intoFlashLabs-AI-Corp:mainfrom
agentifyanchor:main
Open

Optimizing 14GB Model on 4GB VRAM#16
agentifyanchor wants to merge 2 commits intoFlashLabs-AI-Corp:mainfrom
agentifyanchor:main

Conversation

@agentifyanchor
Copy link

@agentifyanchor agentifyanchor commented Jan 24, 2026

Proposal: Low-VRAM Inference Script (4GB GPU Support)

Hi everyone! 👋

I managed to run Chroma-4B successfully on a consumer laptop GPU (RTX 3050 Ti 4GB) using 4-bit quantization (bitsandbytes) and careful memory offloading.

Performance:

  • RTF (Real-Time Factor): ~2.60x (on 4GB VRAM + 8GB RAM Offload)
  • Stability: Rock solid (no OOM crashes).

I realized many developers might be struggling with the 14GB VRAM requirement, so I created a clean, minimal "Walkie-Talkie" script to demonstrate how to run this locally.

Included Files:

  1. local_voice_chat.py: A clean, light script for "Talk & Listen" interaction.
    • Uses direct memory playback (no disk I/O latency).
    • Robust fallback if no custom voice prompt is found.
  2. local_voice_chat_with_telemetry.py: Adds performance metrics (TF, RTF, Input/Output Latency).

I would love to contribute these as examples under local_run to help the community access this amazing model on lower-end hardware.

Best regards,
Ilyes .M
A Fan of Chroma! 🚀

@kaishen-Dotc
Copy link
Member

Hi @agentifyanchor, thanks for this excellent contribution! 🎉
The 4-bit quantization approach and memory offloading strategy look great, and this will definitely help users with limited VRAM.
Before we merge, could you please verify compatibility with transformers 5.0.0? We've noticed that different transformers versions may require adjustments in the modeling code (see Chroma-SGLang for reference which can work in the virsion of 4.57.1).
Test environment:

torch >= 2.7.0
transformers == 5.0.0
accelerate >= 1.7.0

If you encounter any issues with transformers 5.0.0, please let us know the specific errors so we can work on compatibility fixes together.
Thanks again for your contribution to the community!

updated requirements to reflect these tested versions
Copy link
Author

@agentifyanchor agentifyanchor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated requirements.txt to reflect these tested versions

@agentifyanchor
Copy link
Author

Hi @agentifyanchor, thanks for this excellent contribution! 🎉 The 4-bit quantization approach and memory offloading strategy look great, and this will definitely help users with limited VRAM. Before we merge, could you please verify compatibility with transformers 5.0.0? We've noticed that different transformers versions may require adjustments in the modeling code (see Chroma-SGLang for reference which can work in the virsion of 4.57.1). Test environment:

torch >= 2.7.0 transformers == 5.0.0 accelerate >= 1.7.0

If you encounter any issues with transformers 5.0.0, please let us know the specific errors so we can work on compatibility fixes together. Thanks again for your contribution to the community!

Hi @kaishen-Dotc, thank you for the feedback!

I have verified the compatibility with transformers 5.0.0 as requested. I run a test using those versions:

  • Transformers: 5.0.0rc0
  • Torch: 2.6.0+cu124 (Trusted on Latest Stable. 2.7.0 seems to be Nightly, so I stuck to Stable for reliability).
  • Accelerate: 1.12.0 (Verified > 1.7.0).

The model loads correctly with 4-bit quantization and inference works. I ran the full loop (ASR -> Text Generation -> Audio Generation).

I’ve also updated the requirements.txt in the PR to reflect the tested versions.

I attached a screenshot as ref to the local run.
tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants