Implementation of large language models in Java using the Vector API and a prototype of MCP-client and MCP-server.
This implementation is an extension of the implementation llama3.java by Alfonso² Peterssen which is based on Andrej Karpathy's llama2.c and minbpe projects. This extension contains:
- An internal HTTP-server to serve OpenAI-like requests.
- An optional display of important attentions. You can see which tokens the model is most interested in.
- Storing of the KV-cache into a file.
Supported models are:
- DeepSeek-R1-Distill-Qwen-1.5B-Q8_0
- Llama-3 (Llama-3.2, Llama-3.3)
- Phi-3 (CLI only, no http server)
- Qwen-2.5
- Qwen3 (non-MoE)
This project has no dependencies on other libraries. You may have a look at LangChain for Java.
The class UiServer can be used to forward, for example, function calls from a llama.cpp web server to a custom Java implementation of the function. The custom MCP tools are provided via the McpHttpServer class, and the UiServer class utilizes the McpHttpClient class to access these custom tools using the model-context-protocol (without OAuth authentication). These classes are intended for local testing, not for production use. The tools can be provided by implementing the interface org.rogmann.llmva4j.mcp.McpToolImplementations to be used by ServiceLoader.
The installation of an MCP tool is straightforward, but it's always important to consider the potential consequences it might have: What would happen if the language model makes an unwanted call (either due to a calculation error, unwanted training data, or an unexpected/unwanted prompt)? See the following quote from https://modelcontextprotocol.io/docs/concepts/tools for reference:
Tools are designed to be model-controlled, meaning that tools are exposed from servers to clients with the intention of the AI model being able to automatically invoke them (with a human in the loop to grant approval).
If the display of attentions is enabled, the attentions with long value-vectors are displayed. In a translation from english to chinese at the token "三" the model might be interested in the english word "three" (see picture below).
Link to animation:
Options:
--model, -m <path> required, path to .gguf file
--interactive, --chat, -i run in chat mode
--instruct run in instruct (once) mode, default mode
--prompt, -p <string> input prompt
--system-prompt, -sp <string> (optional) system prompt
--temperature, -temp <float> temperature in [0,inf], default 0.1
--top-p <float> p value in top-p (nucleus) sampling in [0,1] default 0.95
--seed <long> random seed, default System.nanoTime()
--max-tokens, -n <int> number of steps to run for < 0 = limited by context length, default 512
--stream <boolean> print tokens during generation; may cause encoding artifacts for non ASCII text, default true
--echo <boolean> print ALL tokens to stderr, if true, recommended to set --stream=false, default false
--host <ip> optional ip-address of http-server (default 127.0.0.1)
--port <port> optional port number of http-server (default 8080)
--path <path> optional path of public-html of http-server
--state-cache-folder optional folder to store state-caches (to save .ggsc-files)
--state-cache, -sc path optional state-cache to be used (read .ggsc-file)
--attention-trace <int> maximum number of attentions to be traced per token
Start a HTTP-server serving DeepSeek-R1-Distill-Qwen-1.5B using the web-UI of llama.cpp:
java --enable-preview --add-modules jdk.incubator.vector -cp target/llmvectorapi4j-0.1.0-SNAPSHOT.jar org.rogmann.llmva4j.Qwen2 -m .../DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -i --path .../git/llama.cpp/examples/server/public -n 2000
Start a HTTP-server serving using the internal web-UI to display attentions.
java --enable-preview --add-modules jdk.incubator.vector -cp target/llmvectorapi4j-0.1.0-SNAPSHOT.jar org.rogmann.llmva4j.Qwen2 -m .../qwen2.5-1.5b-instruct-q8_0.gguf -sp "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." -i --attention-trace 3 --path src/main/webapp