A lightweight LLM inference toolkit focused on minimizing inference latency.
- CUDA Graph: Compilation optimizations that reduce inference latency
- PagedAttention: Efficient KV-cache management enabling long-sequence inference
- Continuous batching: Supports dynamic batch inference optimization
- FlashAttention: Memory optimization for long sequences
💡 All technical details are built on osc-transformers, please visit for more details.
- Install the PyTorch
- Install flash-attn: recommended to use the official prebuilt wheel to avoid build issues
- Install osc-llm
pip install osc-llm --upgradefrom osc_llm import LLM, SamplingParams
# Initialize the model
llm = LLM("checkpoints/Qwen/Qwen3-0.6B", gpu_memory_utilization=0.5, device="cuda:0")
# Chat
messages = [
{"role": "user", "content": "Hello! What's your name?"}
]
sampling_params = SamplingParams(temperature=0.5, top_p=0.95, top_k=40)
result = llm.chat(messages=messages, sampling_params=sampling_params, enable_thinking=True, stream=False)
print(result)
# Streaming generation
for token in llm.chat(messages=messages, sampling_params=sampling_params, enable_thinking=True, stream=True):
print(token, end="", flush=True)- Qwen3ForCausalLM
- Qwen2ForCausalLM