A high-throughput and memory-efficient inference and serving engine for LLMs
amd cuda inference pytorch transformer llama gpt rocm model-serving tpu hpu mlops xpu llm inferentia llmops llm-serving trainium
-
Updated
Jan 8, 2025 - Python