Skip to content

biraj-outspeed/model-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Serving a model from Hugging Face with FastAPI

  • Python 3.11
  • FastAPI
  • Transformers
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B model from Hugging Face

Note: I haven't done batching yet. Will learn about it and implement it soon.

Steps

  1. Clone the repository

  2. Install uv

  3. Install dependencies

    uv sync --frozen
  4. You will need your Hugging Face token once to download the model when running

  5. Run the development server on port 8080

    make dev
  6. Send request to the /completions endpoint for inference

    curl -X POST -H "Content-Type: application/json" -d '{"query":"hello"}' http://127.0.0.1:8080/completion

    This endpoint will stream tokens using server send events. After the LLM is done, it will send a [DONE] event, AND after then it will send the inference stats.

    Example:

    data: isolate
    
    data: y.
    
    data: [DONE]
    
    data: {"start_time": "2025-01-24T19:09:39.295529", "end_time": "2025-01-24T19:09:51.029058", "elapsed_time": 11.733529, "num_tokens": 119, "tokens_per_second": 10.141876327232838}
    

Run the server on port 8080

make start

Docker

You will need to put your Hugging Face token in the hf_token variable in the Makefile.

Build the Docker image

make docker-build

Run the Docker image

make docker-run

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published