- Python 3.11
- FastAPI
- Transformers
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5Bmodel from Hugging Face
Note: I haven't done batching yet. Will learn about it and implement it soon.
-
Clone the repository
-
Install uv
-
Install dependencies
uv sync --frozen
-
You will need your Hugging Face token once to download the model when running
-
Run the development server on port 8080
make dev
-
Send request to the
/completionsendpoint for inferencecurl -X POST -H "Content-Type: application/json" -d '{"query":"hello"}' http://127.0.0.1:8080/completion
This endpoint will stream tokens using server send events. After the LLM is done, it will send a
[DONE]event, AND after then it will send the inference stats.Example:
data: isolate data: y. data: [DONE] data: {"start_time": "2025-01-24T19:09:39.295529", "end_time": "2025-01-24T19:09:51.029058", "elapsed_time": 11.733529, "num_tokens": 119, "tokens_per_second": 10.141876327232838}
make startYou will need to put your Hugging Face token in the hf_token variable in the Makefile.
make docker-buildmake docker-run