nano-vLLM initialization fails, falls back to PT backend and crashes during generation (on Google Colab L4)

**Summary**
I believe I’ve found a regression introduced after commit 6db4465ca0a57c3a57416c4acb73fd5e5dad8986.

On NVIDIA L4 (22GB) with Torch 2.10.0+cu128, newer commits fail to initialize the 5Hz LM using nano-vLLM due to CUDA graph capture errors. ACE-Step then falls back to the PyTorch backend, which later crashes during generation with:

`RuntimeError: Offset increment outside graph capture encountered unexpectedly.`

Pinning to commit 6db4465 (from ~2 days ago) restores correct behavior — vLLM initializes successfully and generation works.

**Environment**
- GPU: NVIDIA L4 (22GB VRAM)
- Runtime: Google Colab
- CUDA detected by ACE-Step: tier6b
- Torch: 2.10.0+cu128
- nano-vLLM installed via project setup
- LM model: acestep-5Hz-lm-1.7B
- DiT config: acestep-v15-turbo

**More details**
During startup of Gradio:
```
Initializing 5Hz LM with model: ..., 
[nanovllm] KV cache allocated ...
❌ Error initializing 5Hz LM:
CUDA error: operation failed due to a previous error during capture
...
torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
```
which then yields
```
WARNING Falling back to PyTorch backend
5Hz LM initialized successfully using PyTorch backend on cuda
```

and if i go to generate a song:
```RuntimeError: Offset increment outside graph capture encountered unexpectedly.```


**What I tried**
- Restarted Gradio / service multiple times
- Disabled torch.compile/dynamo via env vars at launch:
- TORCHDYNAMO_DISABLE=1
- TORCH_COMPILE_DISABLE=1
- TORCHINDUCTOR_DISABLE_CUDAGRAPHS=1
- Selected vLLM backend in UI
- Tried to force eager mode (enforce_eager=True) to disable CUDA graphs, but it did not resolve the issue — logs still show enforce_eager: False and nano-vLLM still attempts CUDA graph capture / fails.

**To reproduce:**
1. Launch Gradio on Colab with NVIDIA L4
2. Initialize service (DiT + LM enabled)
3. Observe vLLM init failure → PT fallback
4. Generate a song → generation crash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nano-vLLM initialization fails, falls back to PT backend and crashes during generation (on Google Colab L4) #616

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nano-vLLM initialization fails, falls back to PT backend and crashes during generation (on Google Colab L4) #616

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions