This notebook demonstrates how to build a simple function-calling AI agent from scratch using the OpenAI-compatible API format and a local vLLM inference server. It shows how to structure prompts, parse model outputs, and route function calls.
NOTE: Before running the notebook, launch a vLLM server in a separate terminal using your model of choice:
vllm serve <model>
I am using Salesforce/xLAM-2-3b-fc-r due to its small size and high ranking on the Berkeley Function-Calling Leaderboard. The model is running locally on my Nvidia RTX 3060.
vLLM launch command:
vllm serve Salesforce/xLAM-2-3b-fc-r --enable-auto-tool-choice --tool-parser-plugin ./xlam_tool_call_parser.py --tool-call-parser xlam --tensor-parallel-size 1 --dtype float16 --gpu-memory-utilization 0.8
The full vLLM launch instructions for this particular model can be found in the Using vLLM for Inference section on the model's HugginngFace page.