Skip to content

A Chat web client for chatting with open-source LLMs deployed behind a vLLM inference server

Notifications You must be signed in to change notification settings

brainqub3/brainqub3_chat

Repository files navigation

Brainqub3 Chat

Brainqub3 Chat is a local-first Next.js workspace that wraps a Brainqub3-themed chat UI, a vLLM OpenAI-compatible orchestrator, and an MCP bridge so models can call local tools. The UI, API routes, and MCP helpers all run on your machine; only the vLLM endpoint needs to be reachable over HTTP, which can be local or a remote RunPod deployment.

Features

  • Brainqub3 UI: Always-dark layout with cyan/purple glow accents, Geist typography, streaming messages, caret animation, and keyboard shortcuts (⌘/Ctrl+K for new chats, ⌘/Ctrl+Enter to send, edits last prompt).
  • Session management: Multi-session rail with previews, rename-on-first-answer, delete, and automatic persistence to localStorage.
  • vLLM orchestrator: /api/chat forwards OpenAI-style chat completion requests (with tool_choice:"auto") to a configurable VLLM_BASE_URL, loops through tool calls, and streams Server-Sent Events back to the browser.
  • MCP bridge (experimental): /api/mcp/* endpoints manage stdio or HTTP MCP servers via the official TypeScript SDK, exposing each MCP tool as an OpenAI tool (mcp:<serverId>:<toolName>). This pathway is currently untested end-to-end, so expect to troubleshoot transports if you enable it.
  • Local controls: Model picker, per-session system prompt pill, estimated token budget bar, and live MCP server status cards.

Local setup

Prerequisites

  1. Node.js 18+ – required for the App Router and Node runtime routes.
  2. vLLM endpoint – any OpenAI-compatible vLLM server. You can:
    • Run it locally (see Running vLLM locally below), or
    • Deploy your own container on RunPod. Start from the RunPod vLLM template, then follow the official RunPod docs to finish configuring the worker and expose the HTTPS endpoint you will paste into VLLM_BASE_URL.
  3. Optional: any MCP servers (stdio binaries or HTTP endpoints) you want the model to call.

1. Install dependencies

npm install

2. Configure environment

Create .env.local in the project root:

VLLM_BASE_URL=http://localhost:8000              # or the HTTPS URL from your RunPod deployment
DEFAULT_MODEL=moonshotai/Kimi-K2-Thinking        # server-side default passed to vLLM
NEXT_PUBLIC_DEFAULT_MODEL=moonshotai/Kimi-K2-Thinking

The defaults ship with the Kimi K2 model; change both variables if you point at a different checkpoint. DEFAULT_MODEL drives the API’s fallback choice, while NEXT_PUBLIC_DEFAULT_MODEL seeds new sessions on the client.

3. Run the services locally

Running vLLM locally

python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2-Thinking \
  --host 0.0.0.0 --port 8000

If you prefer an on-demand GPU endpoint, deploy the RunPod template linked above, then set VLLM_BASE_URL to the provided HTTPS endpoint. The Next.js app treats local and remote URLs the same.

Start the Next.js workspace

npm run dev

Visit http://localhost:3000, create a chat, and start messaging. Expand MCP Servers to register stdio or HTTP transports; enabled servers automatically expose their tools to the model and show up under the “+Tools” indicator.

Development notes

  • API routes opt into the Node runtime so MCP stdio transports can spawn child_process instances.
  • Streaming uses Server-Sent Events and eventsource-parser to buffer tool-call metadata while emitting text deltas immediately.
  • The MCP registry lives in memory; restarting npm run dev clears MCP state, but chat sessions remain in the browser thanks to localStorage.
  • Tailwind centralizes Brainqub3 design tokens (glows, gradients, type scale) so the sidebar, chat pane, and tool cards stay consistent.

NPM scripts

  • npm run dev – Next.js dev server
  • npm run build – production build
  • npm run start – serve the production build
  • npm run lint – ESLint

Possible enhancements

Ideas for later: persist sessions to disk, add an MCP prompt/resource library, or integrate token-aware summarization when transcripts approach the context window.

About

A Chat web client for chatting with open-source LLMs deployed behind a vLLM inference server

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published