Skip to content

di-osc/osc-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSC-LLM

A lightweight LLM inference toolkit focused on minimizing inference latency.

Chinese README

Features

  • CUDA Graph: Compilation optimizations that reduce inference latency
  • PagedAttention: Efficient KV-cache management enabling long-sequence inference
  • Continuous batching: Supports dynamic batch inference optimization
  • FlashAttention: Memory optimization for long sequences

💡 All technical details are built on osc-transformers, please visit for more details.

Installation

  • Install the PyTorch
  • Install flash-attn: recommended to use the official prebuilt wheel to avoid build issues
  • Install osc-llm
pip install osc-llm --upgrade

Quick Start

Basic Usage

from osc_llm import LLM, SamplingParams

# Initialize the model
llm = LLM("checkpoints/Qwen/Qwen3-0.6B", gpu_memory_utilization=0.5, device="cuda:0")

# Chat
messages = [
    {"role": "user", "content": "Hello! What's your name?"}
]
sampling_params = SamplingParams(temperature=0.5, top_p=0.95, top_k=40)
result = llm.chat(messages=messages, sampling_params=sampling_params, enable_thinking=True, stream=False)
print(result)

# Streaming generation
for token in llm.chat(messages=messages, sampling_params=sampling_params, enable_thinking=True, stream=True):
    print(token, end="", flush=True)

Supported Models

  • Qwen3ForCausalLM
  • Qwen2ForCausalLM

About

轻量级大模型推理引擎

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages