diff --git a/README.md b/README.md index 647b9047f2..88e3abd07d 100644 --- a/README.md +++ b/README.md @@ -106,11 +106,11 @@ We can run the Llama-3 model with the chat completion Python API of MLC LLM. You can save the code below into a Python file and run it. ```python -from mlc_llm import LLMEngine +from mlc_llm import MLCEngine # Create engine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" -engine = LLMEngine(model) +engine = MLCEngine(model) # Run chat completion in OpenAI API. for response in engine.chat.completions.create( @@ -125,12 +125,12 @@ print("\n") engine.terminate() ``` -**The Python API of `mlc_llm.LLMEngine` fully aligns with OpenAI API**. -You can use LLMEngine in the same way of using +**The Python API of `mlc_llm.MLCEngine` fully aligns with OpenAI API**. +You can use MLCEngine in the same way of using [OpenAI's Python package](https://github.com/openai/openai-python?tab=readme-ov-file#usage) for both synchronous and asynchronous generation. -If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncLLMEngine` instead. +If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncMLCEngine` instead. ### REST Server diff --git a/docs/deploy/python_engine.rst b/docs/deploy/python_engine.rst index cfbc3b5d4c..89c60ac422 100644 --- a/docs/deploy/python_engine.rst +++ b/docs/deploy/python_engine.rst @@ -4,7 +4,7 @@ Python API ========== .. note:: - This page introduces the Python API with LLMEngine in MLC LLM. + This page introduces the Python API with MLCEngine in MLC LLM. If you want to check out the old Python API which uses :class:`mlc_llm.ChatModule`, please go to :ref:`deploy-python-chat-module` @@ -13,10 +13,10 @@ Python API :depth: 2 -MLC LLM provides Python API through classes :class:`mlc_llm.LLMEngine` and :class:`mlc_llm.AsyncLLMEngine` +MLC LLM provides Python API through classes :class:`mlc_llm.MLCEngine` and :class:`mlc_llm.AsyncMLCEngine` which **support full OpenAI API completeness** for easy integration into other Python projects. -This page introduces how to use the LLM engines in MLC LLM. +This page introduces how to use the engines in MLC LLM. The Python API is a part of the MLC-LLM package, which we have prepared pre-built pip wheels via the :ref:`installation page `. @@ -26,31 +26,31 @@ Verify Installation .. code:: bash - python -c "from mlc_llm import LLMEngine; print(LLMEngine)" + python -c "from mlc_llm import MLCEngine; print(MLCEngine)" -You are expected to see the output of ````. +You are expected to see the output of ````. If the command above results in error, follow :ref:`install-mlc-packages` to install prebuilt pip packages or build MLC LLM from source. -Run LLMEngine +Run MLCEngine ------------- -:class:`mlc_llm.LLMEngine` provides the interface of OpenAI chat completion synchronously. -:class:`mlc_llm.LLMEngine` does not batch concurrent request due to the synchronous design, -and please use :ref:`AsyncLLMEngine ` for request batching process. +:class:`mlc_llm.MLCEngine` provides the interface of OpenAI chat completion synchronously. +:class:`mlc_llm.MLCEngine` does not batch concurrent request due to the synchronous design, +and please use :ref:`AsyncMLCEngine ` for request batching process. **Stream Response.** In :ref:`quick-start` and :ref:`introduction-to-mlc-llm`, -we introduced the basic use of :class:`mlc_llm.LLMEngine`. +we introduced the basic use of :class:`mlc_llm.MLCEngine`. .. code:: python - from mlc_llm import LLMEngine + from mlc_llm import MLCEngine # Create engine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" - engine = LLMEngine(model) + engine = MLCEngine(model) # Run chat completion in OpenAI API. for response in engine.chat.completions.create( @@ -64,9 +64,9 @@ we introduced the basic use of :class:`mlc_llm.LLMEngine`. engine.terminate() -This code example first creates an :class:`mlc_llm.LLMEngine` instance with the 8B Llama-3 model. -**We design the Python API** :class:`mlc_llm.LLMEngine` **to align with OpenAI API**, -which means you can use :class:`mlc_llm.LLMEngine` in the same way of using +This code example first creates an :class:`mlc_llm.MLCEngine` instance with the 8B Llama-3 model. +**We design the Python API** :class:`mlc_llm.MLCEngine` **to align with OpenAI API**, +which means you can use :class:`mlc_llm.MLCEngine` in the same way of using `OpenAI's Python package `_ for both synchronous and asynchronous generation. @@ -90,14 +90,14 @@ for the complete chat completion interface. .. _python-engine-async-llm-engine: -Run AsyncLLMEngine +Run AsyncMLCEngine ------------------ -:class:`mlc_llm.AsyncLLMEngine` provides the interface of OpenAI chat completion with +:class:`mlc_llm.AsyncMLCEngine` provides the interface of OpenAI chat completion with asynchronous features. -**We recommend using** :class:`mlc_llm.AsyncLLMEngine` **to batch concurrent request for better throughput.** +**We recommend using** :class:`mlc_llm.AsyncMLCEngine` **to batch concurrent request for better throughput.** -**Stream Response.** The core use of :class:`mlc_llm.AsyncLLMEngine` for stream responses is as follows. +**Stream Response.** The core use of :class:`mlc_llm.AsyncMLCEngine` for stream responses is as follows. .. code:: python @@ -109,14 +109,14 @@ asynchronous features. for choice in response.choices: print(choice.delta.content, end="", flush=True) -.. collapse:: The collapsed is a complete runnable example of AsyncLLMEngine in Python. +.. collapse:: The collapsed is a complete runnable example of AsyncMLCEngine in Python. .. code:: python import asyncio from typing import Dict - from mlc_llm.serve import AsyncLLMEngine + from mlc_llm.serve import AsyncMLCEngine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" prompts = [ @@ -127,7 +127,7 @@ asynchronous features. async def test_completion(): # Create engine - async_engine = AsyncLLMEngine(model=model) + async_engine = AsyncMLCEngine(model=model) num_requests = len(prompts) output_texts: Dict[str, str] = {} @@ -176,8 +176,8 @@ for the complete chat completion interface. Engine Mode ----------- -To ease the engine configuration, the constructors of :class:`mlc_llm.LLMEngine` and -:class:`mlc_llm.AsyncLLMEngine` have an optional argument ``mode``, +To ease the engine configuration, the constructors of :class:`mlc_llm.MLCEngine` and +:class:`mlc_llm.AsyncMLCEngine` have an optional argument ``mode``, which falls into one of the three options ``"local"``, ``"interactive"`` or ``"server"``. The default mode is ``"local"``. @@ -203,34 +203,34 @@ Deploy Your Own Model with Python API The :ref:`introduction page ` introduces how we can deploy our own models with MLC LLM. This section introduces how you can use the model weights you convert and the model library you build -in :class:`mlc_llm.LLMEngine` and :class:`mlc_llm.AsyncLLMEngine`. +in :class:`mlc_llm.MLCEngine` and :class:`mlc_llm.AsyncMLCEngine`. We use the `Phi-2 `_ as the example model. **Specify Model Weight Path.** Assume you have converted the model weights for your own model, -you can construct a :class:`mlc_llm.LLMEngine` as follows: +you can construct a :class:`mlc_llm.MLCEngine` as follows: .. code:: python - from mlc_llm import LLMEngine + from mlc_llm import MLCEngine model = "models/phi-2" # Assuming the converted phi-2 model weights are under "models/phi-2" - engine = LLMEngine(model) + engine = MLCEngine(model) **Specify Model Library Path.** Further, if you build the model library on your own, -you can use it in :class:`mlc_llm.LLMEngine` by passing the library path through argument ``model_lib_path``. +you can use it in :class:`mlc_llm.MLCEngine` by passing the library path through argument ``model_lib_path``. .. code:: python - from mlc_llm import LLMEngine + from mlc_llm import MLCEngine model = "models/phi-2" model_lib_path = "models/phi-2/lib.so" # Assuming the phi-2 model library is built at "models/phi-2/lib.so" - engine = LLMEngine(model, model_lib_path=model_lib_path) + engine = MLCEngine(model, model_lib_path=model_lib_path) -The same applies to :class:`mlc_llm.AsyncLLMEngine`. +The same applies to :class:`mlc_llm.AsyncMLCEngine`. .. _python-engine-api-reference: @@ -238,16 +238,16 @@ The same applies to :class:`mlc_llm.AsyncLLMEngine`. API Reference ------------- -The :class:`mlc_llm.LLMEngine` and :class:`mlc_llm.AsyncLLMEngine` classes provide the following constructors. +The :class:`mlc_llm.MLCEngine` and :class:`mlc_llm.AsyncMLCEngine` classes provide the following constructors. -The LLMEngine and AsyncLLMEngine have full OpenAI API completeness. +The MLCEngine and AsyncMLCEngine have full OpenAI API completeness. Please refer to `OpenAI's Python package `_ and `OpenAI chat completion API `_ for the complete chat completion interface. .. currentmodule:: mlc_llm -.. autoclass:: LLMEngine +.. autoclass:: MLCEngine :members: :exclude-members: evaluate :undoc-members: @@ -255,7 +255,7 @@ for the complete chat completion interface. .. automethod:: __init__ -.. autoclass:: AsyncLLMEngine +.. autoclass:: AsyncMLCEngine :members: :exclude-members: evaluate :undoc-members: diff --git a/docs/get_started/introduction.rst b/docs/get_started/introduction.rst index 32bcfc4cdb..29060d5a60 100644 --- a/docs/get_started/introduction.rst +++ b/docs/get_started/introduction.rst @@ -90,11 +90,11 @@ You can save the code below into a Python file and run it. .. code:: python - from mlc_llm import LLMEngine + from mlc_llm import MLCEngine # Create engine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" - engine = LLMEngine(model) + engine = MLCEngine(model) # Run chat completion in OpenAI API. for response in engine.chat.completions.create( @@ -114,9 +114,9 @@ You can save the code below into a Python file and run it. MLC LLM Python API -This code example first creates an :class:`mlc_llm.LLMEngine` instance with the 4-bit quantized Llama-3 model. -**We design the Python API** :class:`mlc_llm.LLMEngine` **to align with OpenAI API**, -which means you can use :class:`mlc_llm.LLMEngine` in the same way of using +This code example first creates an :class:`mlc_llm.MLCEngine` instance with the 4-bit quantized Llama-3 model. +**We design the Python API** :class:`mlc_llm.MLCEngine` **to align with OpenAI API**, +which means you can use :class:`mlc_llm.MLCEngine` in the same way of using `OpenAI's Python package `_ for both synchronous and asynchronous generation. @@ -134,7 +134,7 @@ If you want to run without streaming, you can run print(response) You can also try different arguments supported in `OpenAI chat completion API `_. -If you would like to do concurrent asynchronous generation, you can use :class:`mlc_llm.AsyncLLMEngine` instead. +If you would like to do concurrent asynchronous generation, you can use :class:`mlc_llm.AsyncMLCEngine` instead. REST Server ----------- @@ -229,7 +229,7 @@ You can also use this model in Python API, MLC serve and other use scenarios. (Optional) Compile Model Library ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In previous sections, model libraries are compiled when the :class:`mlc_llm.LLMEngine` launches, +In previous sections, model libraries are compiled when the :class:`mlc_llm.MLCEngine` launches, which is what we call "JIT (Just-in-Time) model compilation". In some cases, it is beneficial to explicitly compile the model libraries. We can deploy LLMs with reduced dependencies by shipping the library for deployment without going through compilation. @@ -257,12 +257,12 @@ At runtime, we need to specify this model library path to use it. For example, .. code:: python - from mlc_llm import LLMEngine + from mlc_llm import MLCEngine # For Python API model = "models/phi-2" model_lib_path = "models/phi-2/lib.so" - engine = LLMEngine(model, model_lib_path=model_lib_path) + engine = MLCEngine(model, model_lib_path=model_lib_path) :ref:`compile-model-libraries` introduces the model compilation command in detail, where you can find instructions and example commands to compile model to different diff --git a/docs/get_started/quick_start.rst b/docs/get_started/quick_start.rst index 76d971275b..8349197eda 100644 --- a/docs/get_started/quick_start.rst +++ b/docs/get_started/quick_start.rst @@ -20,11 +20,11 @@ It is recommended to have at least 6GB free VRAM to run it. .. code:: python - from mlc_llm import LLMEngine + from mlc_llm import MLCEngine # Create engine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" - engine = LLMEngine(model) + engine = MLCEngine(model) # Run chat completion in OpenAI API. for response in engine.chat.completions.create( diff --git a/examples/python/sample_mlc_engine.py b/examples/python/sample_mlc_engine.py index f76e44c620..e4f869930f 100644 --- a/examples/python/sample_mlc_engine.py +++ b/examples/python/sample_mlc_engine.py @@ -1,8 +1,8 @@ -from mlc_llm import LLMEngine +from mlc_llm import MLCEngine # Create engine model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC" -engine = LLMEngine(model) +engine = MLCEngine(model) # Run chat completion in OpenAI API. for response in engine.chat.completions.create( diff --git a/python/mlc_llm/__init__.py b/python/mlc_llm/__init__.py index 8e3aaaa808..4843c6766d 100644 --- a/python/mlc_llm/__init__.py +++ b/python/mlc_llm/__init__.py @@ -6,4 +6,4 @@ from . import protocol, serve from .chat_module import ChatConfig, ChatModule, ConvConfig, GenerationConfig from .libinfo import __version__ -from .serve import AsyncLLMEngine, LLMEngine +from .serve import AsyncMLCEngine, MLCEngine diff --git a/python/mlc_llm/help.py b/python/mlc_llm/help.py index eff6f6f46e..14e5cee321 100644 --- a/python/mlc_llm/help.py +++ b/python/mlc_llm/help.py @@ -203,7 +203,7 @@ The number of draft tokens to generate in speculative proposal. The default values is 4. """, "engine_config_serve": """ -The LLMEngine execution configuration. +The MLCEngine execution configuration. Currently speculative decoding mode is specified via engine config. For example, you can use "--engine-config='spec_draft_length=4;speculative_mode=EAGLE'" to specify the eagle-style speculative decoding. diff --git a/python/mlc_llm/interface/serve.py b/python/mlc_llm/interface/serve.py index c5696ef473..d0cbd4690b 100644 --- a/python/mlc_llm/interface/serve.py +++ b/python/mlc_llm/interface/serve.py @@ -35,7 +35,7 @@ def serve( ): # pylint: disable=too-many-arguments, too-many-locals """Serve the model with the specified configuration.""" # Create engine and start the background loop - async_engine = engine.AsyncLLMEngine( + async_engine = engine.AsyncMLCEngine( model=model, device=device, model_lib_path=model_lib_path, diff --git a/python/mlc_llm/serve/__init__.py b/python/mlc_llm/serve/__init__.py index 79caff7cad..59358c1646 100644 --- a/python/mlc_llm/serve/__init__.py +++ b/python/mlc_llm/serve/__init__.py @@ -4,7 +4,7 @@ from .. import base from .config import EngineConfig, GenerationConfig, SpeculativeMode from .data import Data, ImageData, RequestStreamOutput, TextData, TokenData -from .engine import AsyncLLMEngine, LLMEngine +from .engine import AsyncMLCEngine, MLCEngine from .grammar import BNFGrammar, GrammarStateMatcher from .radix_tree import PagedRadixTree from .request import Request diff --git a/python/mlc_llm/serve/config.py b/python/mlc_llm/serve/config.py index 773a00625e..60e4eca8c5 100644 --- a/python/mlc_llm/serve/config.py +++ b/python/mlc_llm/serve/config.py @@ -141,7 +141,7 @@ class SpeculativeMode(enum.IntEnum): @tvm._ffi.register_object("mlc.serve.EngineConfig") # pylint: disable=protected-access class EngineConfig(tvm.runtime.Object): - """The class of LLMEngine execution configuration. + """The class of MLCEngine execution configuration. Parameters ---------- diff --git a/python/mlc_llm/serve/engine.py b/python/mlc_llm/serve/engine.py index febf88e99e..d9721b4864 100644 --- a/python/mlc_llm/serve/engine.py +++ b/python/mlc_llm/serve/engine.py @@ -37,10 +37,10 @@ class Chat: # pylint: disable=too-few-public-methods """The proxy class to direct to chat completions.""" def __init__(self, engine: weakref.ReferenceType) -> None: - assert isinstance(engine(), (AsyncLLMEngine, LLMEngine)) + assert isinstance(engine(), (AsyncMLCEngine, MLCEngine)) self.completions = ( AsyncChatCompletion(engine) # type: ignore - if isinstance(engine(), AsyncLLMEngine) + if isinstance(engine(), AsyncMLCEngine) else ChatCompletion(engine) # type: ignore ) @@ -49,7 +49,7 @@ class AsyncChatCompletion: # pylint: disable=too-few-public-methods """The proxy class to direct to async chat completions.""" if sys.version_info >= (3, 9): - engine: weakref.ReferenceType["AsyncLLMEngine"] + engine: weakref.ReferenceType["AsyncMLCEngine"] else: engine: weakref.ReferenceType @@ -226,7 +226,7 @@ class ChatCompletion: # pylint: disable=too-few-public-methods """The proxy class to direct to chat completions.""" if sys.version_info >= (3, 9): - engine: weakref.ReferenceType["LLMEngine"] + engine: weakref.ReferenceType["MLCEngine"] else: engine: weakref.ReferenceType @@ -401,7 +401,7 @@ class AsyncCompletion: # pylint: disable=too-few-public-methods """The proxy class to direct to async completions.""" if sys.version_info >= (3, 9): - engine: weakref.ReferenceType["AsyncLLMEngine"] + engine: weakref.ReferenceType["AsyncMLCEngine"] else: engine: weakref.ReferenceType @@ -580,7 +580,7 @@ class Completion: # pylint: disable=too-few-public-methods """The proxy class to direct to completions.""" if sys.version_info >= (3, 9): - engine: weakref.ReferenceType["LLMEngine"] + engine: weakref.ReferenceType["MLCEngine"] else: engine: weakref.ReferenceType @@ -752,8 +752,8 @@ def create( # pylint: disable=too-many-arguments,too-many-locals ) -class AsyncLLMEngine(engine_base.LLMEngineBase): - """The AsyncLLMEngine in MLC LLM that provides the asynchronous +class AsyncMLCEngine(engine_base.MLCEngineBase): + """The AsyncMLCEngine in MLC LLM that provides the asynchronous interfaces with regard to OpenAI API. Parameters @@ -825,7 +825,7 @@ class AsyncLLMEngine(engine_base.LLMEngineBase): memory usage may be slightly larger than this number. engine_config : Optional[EngineConfig] - The LLMEngine execution configuration. + The MLCEngine execution configuration. Currently speculative decoding mode is specified via engine config. For example, you can use "--engine-config='spec_draft_length=4;speculative_mode=EAGLE'" to specify the eagle-style speculative decoding. @@ -1228,7 +1228,7 @@ async def _generate( generation_config: GenerationConfig, request_id: str, ) -> AsyncGenerator[List[engine_base.CallbackStreamOutput], Any]: - """Internal asynchronous text generation interface of AsyncLLMEngine. + """Internal asynchronous text generation interface of AsyncMLCEngine. The method is a coroutine that streams a list of CallbackStreamOutput at a time via yield. The returned list length is the number of parallel generations specified by `generation_config.n`. @@ -1298,8 +1298,8 @@ def _abort(self, request_id: str): self._ffi["abort_request"](request_id) -class LLMEngine(engine_base.LLMEngineBase): - """The LLMEngine in MLC LLM that provides the synchronous +class MLCEngine(engine_base.MLCEngineBase): + """The MLCEngine in MLC LLM that provides the synchronous interfaces with regard to OpenAI API. Parameters @@ -1371,7 +1371,7 @@ class LLMEngine(engine_base.LLMEngineBase): memory usage may be slightly larger than this number. engine_config : Optional[EngineConfig] - The LLMEngine execution configuration. + The MLCEngine execution configuration. Currently speculative decoding mode is specified via engine config. For example, you can use "--engine-config='spec_draft_length=4;speculative_mode=EAGLE'" to specify the eagle-style speculative decoding. @@ -1767,7 +1767,7 @@ def _generate( # pylint: disable=too-many-locals generation_config: GenerationConfig, request_id: str, ) -> Iterator[List[engine_base.CallbackStreamOutput]]: - """Internal synchronous text generation interface of AsyncLLMEngine. + """Internal synchronous text generation interface of AsyncMLCEngine. The method is a coroutine that streams a list of CallbackStreamOutput at a time via yield. The returned list length is the number of parallel generations specified by `generation_config.n`. @@ -1821,7 +1821,7 @@ def _generate( # pylint: disable=too-many-locals def _request_stream_callback_impl( self, delta_outputs: List[data.RequestStreamOutput] ) -> List[List[engine_base.CallbackStreamOutput]]: - """The underlying implementation of request stream callback of LLMEngine.""" + """The underlying implementation of request stream callback of MLCEngine.""" batch_outputs: List[List[engine_base.CallbackStreamOutput]] = [] for delta_output in delta_outputs: request_id, stream_outputs = delta_output.unpack() diff --git a/python/mlc_llm/serve/engine_base.py b/python/mlc_llm/serve/engine_base.py index 6d89d223d1..7b2ede60b2 100644 --- a/python/mlc_llm/serve/engine_base.py +++ b/python/mlc_llm/serve/engine_base.py @@ -464,7 +464,7 @@ def infer_args_under_mode( @dataclass class CallbackStreamOutput: - """The output of LLMEngine._generate and AsyncLLMEngine._generate + """The output of MLCEngine._generate and AsyncMLCEngine._generate Attributes ---------- @@ -489,7 +489,7 @@ class CallbackStreamOutput: class AsyncRequestStream: - """The asynchronous stream for requests in AsyncLLMEngine. + """The asynchronous stream for requests in AsyncMLCEngine. Each request has its own unique stream. The stream exposes the method `push` for engine to push new generated @@ -548,29 +548,29 @@ async def __anext__(self) -> List[CallbackStreamOutput]: class EngineState: """The engine states that the request stream callback function may use. - This class is used for both AsyncLLMEngine and LLMEngine. - AsyncLLMEngine uses the fields and methods starting with "async", - and LLMEngine uses the ones starting with "sync". + This class is used for both AsyncMLCEngine and MLCEngine. + AsyncMLCEngine uses the fields and methods starting with "async", + and MLCEngine uses the ones starting with "sync". - - For AsyncLLMEngine, the state contains an asynchronous event loop, + - For AsyncMLCEngine, the state contains an asynchronous event loop, the streamers and the number of unfinished generations for each request being processed. - - For LLMEngine, the state contains a callback output blocking queue, + - For MLCEngine, the state contains a callback output blocking queue, the text streamers and the number of unfinished requests. We use this state class to avoid the callback function from capturing - the AsyncLLMEngine. + the AsyncMLCEngine. The state also optionally maintains an event trace recorder, which can provide Chrome tracing when enabled. """ trace_recorder = None - # States used for AsyncLLMEngine + # States used for AsyncMLCEngine async_event_loop: Optional[asyncio.AbstractEventLoop] = None async_streamers: Dict[str, Tuple[AsyncRequestStream, List[TextStreamer]]] = {} async_num_unfinished_generations: Dict[str, int] = {} - # States used for LLMEngine + # States used for MLCEngine sync_output_queue: queue.Queue = queue.Queue() sync_text_streamers: List[TextStreamer] = [] sync_num_unfinished_generations: int = 0 @@ -632,7 +632,7 @@ def async_lazy_init_event_loop(self) -> None: self.async_event_loop = asyncio.get_event_loop() def _async_request_stream_callback(self, delta_outputs: List[data.RequestStreamOutput]) -> None: - """The request stream callback function for AsyncLLMEngine to stream back + """The request stream callback function for AsyncMLCEngine to stream back the request generation results. Note @@ -652,7 +652,7 @@ def _async_request_stream_callback(self, delta_outputs: List[data.RequestStreamO def _async_request_stream_callback_impl( self, delta_outputs: List[data.RequestStreamOutput] ) -> None: - """The underlying implementation of request stream callback for AsyncLLMEngine.""" + """The underlying implementation of request stream callback for AsyncMLCEngine.""" for delta_output in delta_outputs: request_id, stream_outputs = delta_output.unpack() streamers = self.async_streamers.get(request_id, None) @@ -693,28 +693,28 @@ def _async_request_stream_callback_impl( self.record_event(request_id, event="finish callback") def _sync_request_stream_callback(self, delta_outputs: List[data.RequestStreamOutput]) -> None: - """The request stream callback function for LLMEngine to stream back + """The request stream callback function for MLCEngine to stream back the request generation results. """ # Put the delta outputs to the queue in the unblocking way. self.sync_output_queue.put_nowait(delta_outputs) -class LLMEngineBase: # pylint: disable=too-many-instance-attributes,too-few-public-methods +class MLCEngineBase: # pylint: disable=too-many-instance-attributes,too-few-public-methods """The base engine class, which implements common functions that - are shared by LLMEngine and AsyncLLMEngine. + are shared by MLCEngine and AsyncMLCEngine. This class wraps a threaded engine that runs on a standalone thread inside and streams back the delta generated results via callback functions. The internal threaded engine keeps running an loop that drives the engine. - LLMEngine and AsyncLLMEngine inherits this LLMEngineBase class, and implements + MLCEngine and AsyncMLCEngine inherits this MLCEngineBase class, and implements their own methods to process the delta generated results received from callback functions and yield the processed delta results in the forms of standard API protocols. - Checkout subclasses AsyncLLMEngine/LLMEngine for the docstring of constructor parameters. + Checkout subclasses AsyncMLCEngine/MLCEngine for the docstring of constructor parameters. """ def __init__( # pylint: disable=too-many-arguments,too-many-locals diff --git a/python/mlc_llm/serve/server/server_context.py b/python/mlc_llm/serve/server/server_context.py index 46b841aaa9..d6acd4a2be 100644 --- a/python/mlc_llm/serve/server/server_context.py +++ b/python/mlc_llm/serve/server/server_context.py @@ -2,7 +2,7 @@ from typing import Dict, List, Optional -from ..engine import AsyncLLMEngine +from ..engine import AsyncMLCEngine class ServerContext: @@ -13,7 +13,7 @@ class ServerContext: server_context: Optional["ServerContext"] = None def __init__(self): - self._models: Dict[str, AsyncLLMEngine] = {} + self._models: Dict[str, AsyncMLCEngine] = {} def __enter__(self): if ServerContext.server_context is not None: @@ -31,13 +31,13 @@ def current(): """Returns the current ServerContext.""" return ServerContext.server_context - def add_model(self, hosted_model: str, engine: AsyncLLMEngine) -> None: + def add_model(self, hosted_model: str, engine: AsyncMLCEngine) -> None: """Add a new model to the server context together with the engine.""" if hosted_model in self._models: raise RuntimeError(f"Model {hosted_model} already running.") self._models[hosted_model] = engine - def get_engine(self, model: Optional[str]) -> Optional[AsyncLLMEngine]: + def get_engine(self, model: Optional[str]) -> Optional[AsyncMLCEngine]: """Get the async engine of the requested model, or the unique async engine if only one engine is served.""" if len(self._models) == 1: diff --git a/python/mlc_llm/serve/sync_engine.py b/python/mlc_llm/serve/sync_engine.py index 23b151d5c7..257338da3a 100644 --- a/python/mlc_llm/serve/sync_engine.py +++ b/python/mlc_llm/serve/sync_engine.py @@ -41,7 +41,7 @@ def _create_tvm_module( return {key: module[key] for key in ffi_funcs} -class SyncLLMEngine: +class SyncMLCEngine: """The Python interface of synchronize request serving engine for MLC LLM. The engine receives requests from the "add_request" method. For diff --git a/tests/python/serve/evaluate_engine.py b/tests/python/serve/evaluate_engine.py index 4e541b7437..c89a9e2c38 100644 --- a/tests/python/serve/evaluate_engine.py +++ b/tests/python/serve/evaluate_engine.py @@ -5,7 +5,7 @@ from typing import List, Tuple from mlc_llm.serve import GenerationConfig -from mlc_llm.serve.sync_engine import SyncLLMEngine +from mlc_llm.serve.sync_engine import SyncMLCEngine def _parse_args(): @@ -41,7 +41,7 @@ def benchmark(args: argparse.Namespace): random.seed(args.seed) # Create engine - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=args.model, device=args.device, model_lib_path=args.model_lib_path, diff --git a/tests/python/serve/test_serve_async_engine.py b/tests/python/serve/test_serve_async_engine.py index 9bece30578..6e3835238a 100644 --- a/tests/python/serve/test_serve_async_engine.py +++ b/tests/python/serve/test_serve_async_engine.py @@ -3,7 +3,7 @@ import asyncio from typing import List -from mlc_llm.serve import AsyncLLMEngine, GenerationConfig +from mlc_llm.serve import AsyncMLCEngine, GenerationConfig prompts = [ "What is the meaning of life?", @@ -23,7 +23,7 @@ async def test_engine_generate(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - async_engine = AsyncLLMEngine( + async_engine = AsyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -39,7 +39,7 @@ async def test_engine_generate(): ] async def generate_task( - async_engine: AsyncLLMEngine, + async_engine: AsyncMLCEngine, prompt: str, generation_cfg: GenerationConfig, request_id: str, @@ -80,7 +80,7 @@ async def test_chat_completion(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - async_engine = AsyncLLMEngine( + async_engine = AsyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -132,7 +132,7 @@ async def test_chat_completion_non_stream(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - async_engine = AsyncLLMEngine( + async_engine = AsyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -183,7 +183,7 @@ async def test_completion(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - async_engine = AsyncLLMEngine( + async_engine = AsyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -235,7 +235,7 @@ async def test_completion_non_stream(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - async_engine = AsyncLLMEngine( + async_engine = AsyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", diff --git a/tests/python/serve/test_serve_async_engine_spec.py b/tests/python/serve/test_serve_async_engine_spec.py index de91c845b3..c3963af613 100644 --- a/tests/python/serve/test_serve_async_engine_spec.py +++ b/tests/python/serve/test_serve_async_engine_spec.py @@ -3,7 +3,7 @@ import asyncio from typing import List -from mlc_llm.serve import AsyncLLMEngine, GenerationConfig, SpeculativeMode +from mlc_llm.serve import AsyncMLCEngine, GenerationConfig, SpeculativeMode prompts = [ "What is the meaning of life?", @@ -27,7 +27,7 @@ async def test_engine_generate(): small_model_lib_path = ( "dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so" ) - async_engine = AsyncLLMEngine( + async_engine = AsyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -44,7 +44,7 @@ async def test_engine_generate(): ] async def generate_task( - async_engine: AsyncLLMEngine, + async_engine: AsyncMLCEngine, prompt: str, generation_cfg: GenerationConfig, request_id: str, diff --git a/tests/python/serve/test_serve_engine.py b/tests/python/serve/test_serve_engine.py index 330bd4cf82..f965e8cc82 100644 --- a/tests/python/serve/test_serve_engine.py +++ b/tests/python/serve/test_serve_engine.py @@ -2,7 +2,7 @@ # pylint: disable=too-many-arguments,too-many-locals,unused-argument,unused-variable from typing import List -from mlc_llm.serve import GenerationConfig, LLMEngine +from mlc_llm.serve import GenerationConfig, MLCEngine prompts = [ "What is the meaning of life?", @@ -22,7 +22,7 @@ def test_engine_generate(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = LLMEngine( + engine = MLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -61,7 +61,7 @@ def test_chat_completion(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = LLMEngine( + engine = MLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -105,7 +105,7 @@ def test_chat_completion_non_stream(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = LLMEngine( + engine = MLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -148,7 +148,7 @@ def test_completion(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = LLMEngine( + engine = MLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -192,7 +192,7 @@ def test_completion_non_stream(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = LLMEngine( + engine = MLCEngine( model=model, model_lib_path=model_lib_path, mode="server", diff --git a/tests/python/serve/test_serve_engine_grammar.py b/tests/python/serve/test_serve_engine_grammar.py index 7f2a33b230..b764c62cd2 100644 --- a/tests/python/serve/test_serve_engine_grammar.py +++ b/tests/python/serve/test_serve_engine_grammar.py @@ -7,9 +7,9 @@ import pytest from pydantic import BaseModel -from mlc_llm.serve import AsyncLLMEngine, GenerationConfig +from mlc_llm.serve import AsyncMLCEngine, GenerationConfig from mlc_llm.serve.config import ResponseFormat -from mlc_llm.serve.sync_engine import SyncLLMEngine +from mlc_llm.serve.sync_engine import SyncMLCEngine prompts_list = [ "Generate a JSON string containing 20 objects:", @@ -22,7 +22,7 @@ def test_batch_generation_with_grammar(): # Create engine - engine = SyncLLMEngine(model=model_path, model_lib_path=model_lib_path, mode="server") + engine = SyncMLCEngine(model=model_path, model_lib_path=model_lib_path, mode="server") prompt_len = len(prompts_list) prompts = prompts_list * 3 @@ -69,7 +69,7 @@ def test_batch_generation_with_grammar(): def test_batch_generation_with_schema(): # Create engine - engine = SyncLLMEngine(model=model_path, model_lib_path=model_lib_path, mode="server") + engine = SyncMLCEngine(model=model_path, model_lib_path=model_lib_path, mode="server") prompt = ( "Generate a json containing three fields: an integer field named size, a " @@ -121,7 +121,7 @@ class Schema(BaseModel): async def run_async_engine(): # Create engine - async_engine = AsyncLLMEngine(model=model_path, model_lib_path=model_lib_path, mode="server") + async_engine = AsyncMLCEngine(model=model_path, model_lib_path=model_lib_path, mode="server") prompts = prompts_list * 20 @@ -142,7 +142,7 @@ async def run_async_engine(): ] async def generate_task( - async_engine: AsyncLLMEngine, + async_engine: AsyncMLCEngine, prompt: str, generation_cfg: GenerationConfig, request_id: str, diff --git a/tests/python/serve/test_serve_engine_image.py b/tests/python/serve/test_serve_engine_image.py index ff64e7235b..59e8c97196 100644 --- a/tests/python/serve/test_serve_engine_image.py +++ b/tests/python/serve/test_serve_engine_image.py @@ -2,7 +2,7 @@ from pathlib import Path from mlc_llm.serve import GenerationConfig, data -from mlc_llm.serve.sync_engine import SyncLLMEngine +from mlc_llm.serve.sync_engine import SyncMLCEngine def get_test_image(config) -> data.ImageData: @@ -13,7 +13,7 @@ def test_engine_generate(): # Create engine model = "dist/llava-1.5-7b-hf-q4f16_1-MLC/params" model_lib_path = "dist/llava-1.5-7b-hf-q4f16_1-MLC/llava-1.5-7b-hf-q4f16_1-MLC.so" - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", diff --git a/tests/python/serve/test_serve_engine_spec.py b/tests/python/serve/test_serve_engine_spec.py index 6647c7af19..33c06b1c5e 100644 --- a/tests/python/serve/test_serve_engine_spec.py +++ b/tests/python/serve/test_serve_engine_spec.py @@ -11,7 +11,7 @@ SpeculativeMode, data, ) -from mlc_llm.serve.sync_engine import SyncLLMEngine +from mlc_llm.serve.sync_engine import SyncMLCEngine prompts = [ "What is the meaning of life?", @@ -90,7 +90,7 @@ def fcallback(delta_outputs: List[RequestStreamOutput]): small_model_lib_path = ( "dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so" ) - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -158,7 +158,7 @@ def fcallback(delta_outputs: List[RequestStreamOutput]): small_model_lib_path = ( "dist/Eagle-llama2-7b-chat-q0f16-MLC/Eagle-llama2-7b-chat-q0f16-MLC-cuda.so" ) - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -242,7 +242,7 @@ def step(self) -> None: "dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so" ) timer = CallbackTimer() - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -328,7 +328,7 @@ def step(self) -> None: "dist/Eagle-llama2-7b-chat-q4f16_1-MLC/Eagle-llama2-7b-chat-q4f16_1-MLC-cuda.so" ) timer = CallbackTimer() - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -385,7 +385,7 @@ def test_engine_generate(compare_precision=False): "dist/Llama-2-7b-chat-hf-q4f16_1-MLC/Llama-2-7b-chat-hf-q4f16_1-MLC-cuda.so" ) - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -403,7 +403,7 @@ def test_engine_generate(compare_precision=False): generation_config = GenerationConfig( temperature=0.0, top_p=0, max_tokens=1024, stop_token_ids=[2], n=1 ) - engine_single_model = SyncLLMEngine( + engine_single_model = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -446,7 +446,7 @@ def test_engine_eagle_generate(): small_model_lib_path = ( "dist/Eagle-llama2-7b-chat-q4f16_1-MLC/Eagle-llama2-7b-chat-q4f16_1-MLC-cuda.so" ) - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -494,7 +494,7 @@ def fcallback(delta_outputs: List[RequestStreamOutput]): # Create engine model = "dist/Llama-2-13b-chat-hf-q4f16_1-MLC" model_lib_path = "dist/Llama-2-13b-chat-hf-q4f16_1-MLC/Llama-2-13b-chat-hf-q4f16_1-MLC-cuda.so" - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -566,7 +566,7 @@ def fcallback(delta_outputs: List[RequestStreamOutput]): # small_model_lib_path = ( # "dist/TinyLlama-1.1B-Chat-v1.0-q0f16-MLC/TinyLlama-1.1B-Chat-v1.0-q0f16-MLC-cuda.so" # ) - spec_engine = SyncLLMEngine( + spec_engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -636,7 +636,7 @@ def fcallback(delta_outputs: List[RequestStreamOutput]): small_model_lib_path = ( "dist/Eagle-llama2-7b-chat-q0f16-MLC/Eagle-llama2-7b-chat-q0f16-MLC-cuda.so" ) - spec_engine = SyncLLMEngine( + spec_engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", diff --git a/tests/python/serve/test_serve_sync_engine.py b/tests/python/serve/test_serve_sync_engine.py index c5d521b02d..f68f48b7c5 100644 --- a/tests/python/serve/test_serve_sync_engine.py +++ b/tests/python/serve/test_serve_sync_engine.py @@ -5,7 +5,7 @@ import numpy as np from mlc_llm.serve import GenerationConfig, Request, RequestStreamOutput, data -from mlc_llm.serve.sync_engine import SyncLLMEngine +from mlc_llm.serve.sync_engine import SyncMLCEngine prompts = [ "What is the meaning of life?", @@ -80,7 +80,7 @@ def fcallback(delta_outputs: List[RequestStreamOutput]): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -156,7 +156,7 @@ def step(self) -> None: timer = CallbackTimer() model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -237,7 +237,7 @@ def step(self) -> None: timer = CallbackTimer() model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -323,7 +323,7 @@ def all_finished(self) -> bool: timer = CallbackTimer() model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server", @@ -365,7 +365,7 @@ def test_engine_generate(): # Create engine model = "dist/Llama-2-7b-chat-hf-q0f16-MLC" model_lib_path = "dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so" - engine = SyncLLMEngine( + engine = SyncMLCEngine( model=model, model_lib_path=model_lib_path, mode="server",