llmvectorapi4j

Implementation of large language models in Java using the Vector API and a prototype of MCP-client and MCP-server.

This implementation is an extension of the implementation llama3.java by Alfonso² Peterssen which is based on Andrej Karpathy's llama2.c and minbpe projects. This extension contains:

An internal HTTP-server to serve OpenAI-like requests.
An optional display of important attentions. You can see which tokens the model is most interested in.
Storing of the KV-cache into a file.

Supported models are:

DeepSeek-R1-Distill-Qwen-1.5B-Q8_0
Llama-3 (Llama-3.2, Llama-3.3)
Phi-3 (CLI only, no http server)
Qwen-2.5
Qwen3 (non-MoE)

This project has no dependencies on other libraries. You may have a look at LangChain for Java.

MCP

The class UiServer can be used to forward, for example, function calls from a llama.cpp web server to a custom Java implementation of the function. The custom MCP tools are provided via the McpHttpServer class, and the UiServer class utilizes the McpHttpClient class to access these custom tools using the model-context-protocol (without OAuth authentication). These classes are intended for local testing, not for production use. The tools can be provided by implementing the interface org.rogmann.llmva4j.mcp.McpToolImplementations to be used by ServiceLoader.

Security Note for MCP

The installation of an MCP tool is straightforward, but it's always important to consider the potential consequences it might have: What would happen if the language model makes an unwanted call (either due to a calculation error, unwanted training data, or an unexpected/unwanted prompt)? See the following quote from https://modelcontextprotocol.io/docs/concepts/tools for reference:

Tools are designed to be model-controlled, meaning that tools are exposed from servers to clients with the intention of the AI model being able to automatically invoke them (with a human in the loop to grant approval).

Display of Attentions

If the display of attentions is enabled, the attentions with long value-vectors are displayed. In a translation from english to chinese at the token "三" the model might be interested in the english word "three" (see picture below).

Link to animation:

Usage

Options:
  --model, -m <path>            required, path to .gguf file
  --interactive, --chat, -i     run in chat mode
  --instruct                    run in instruct (once) mode, default mode
  --prompt, -p <string>         input prompt
  --system-prompt, -sp <string> (optional) system prompt
  --temperature, -temp <float>  temperature in [0,inf], default 0.1
  --top-p <float>               p value in top-p (nucleus) sampling in [0,1] default 0.95
  --seed <long>                 random seed, default System.nanoTime()
  --max-tokens, -n <int>        number of steps to run for < 0 = limited by context length, default 512
  --stream <boolean>            print tokens during generation; may cause encoding artifacts for non ASCII text, default true
  --echo <boolean>              print ALL tokens to stderr, if true, recommended to set --stream=false, default false
  --host <ip>                   optional ip-address of http-server (default 127.0.0.1)
  --port <port>                 optional port number of http-server (default 8080)
  --path <path>                 optional path of public-html of http-server
  --state-cache-folder          optional folder to store state-caches (to save .ggsc-files)
  --state-cache, -sc path       optional state-cache to be used (read .ggsc-file)
  --attention-trace <int>       maximum number of attentions to be traced per token

Examples

Start a HTTP-server serving DeepSeek-R1-Distill-Qwen-1.5B using the web-UI of llama.cpp:

java --enable-preview --add-modules jdk.incubator.vector -cp target/llmvectorapi4j-0.1.0-SNAPSHOT.jar org.rogmann.llmva4j.Qwen2 -m .../DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf -i --path .../git/llama.cpp/examples/server/public -n 2000

Start a HTTP-server serving using the internal web-UI to display attentions.

java --enable-preview --add-modules jdk.incubator.vector -cp target/llmvectorapi4j-0.1.0-SNAPSHOT.jar org.rogmann.llmva4j.Qwen2 -m .../qwen2.5-1.5b-instruct-q8_0.gguf -sp "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." -i --attention-trace 3 --path src/main/webapp

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
docs		docs
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llmvectorapi4j

MCP

Security Note for MCP

Display of Attentions

Usage

Examples

About

Uh oh!

Releases

Packages

Languages

License

srogmann/llmvectorapi4j

Folders and files

Latest commit

History

Repository files navigation

llmvectorapi4j

MCP

Security Note for MCP

Display of Attentions

Usage

Examples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages