Skip to content

feat: vision/image support via stream-json + CLI isolation#14

Open
rjabalosiii wants to merge 1 commit intoatalovesyou:mainfrom
rjabalosiii:feat/vision-multimodal-support
Open

feat: vision/image support via stream-json + CLI isolation#14
rjabalosiii wants to merge 1 commit intoatalovesyou:mainfrom
rjabalosiii:feat/vision-multimodal-support

Conversation

@rjabalosiii
Copy link

Summary

Adds multimodal (text + image) support for OpenAI-compatible clients that send base64 screenshots, such as Browser Use. Also fixes several compatibility issues discovered during real-world testing.

What changed

  • Vision/image support: Auto-detects image_url content parts in OpenAI requests and switches to --input-format stream-json mode, piping NDJSON with base64 image blocks via stdin. Text-only requests still use the fast CLI argument path (zero overhead when no images).
  • Multimodal content arrays: Handles OpenAI's content field as both string and Array<{type, text?, image_url?}> — required for any client sending screenshots or structured content.
  • Code fence stripping: Claude wraps JSON responses in markdown code fences (\``json...). Added stripCodeFences()` to extract clean JSON for clients expecting raw JSON (e.g., Browser Use's structured output).
  • CLI isolation: Prevents Claude Code's agentic system prompt, tools, skills, and settings files from leaking into proxy responses. Flags: --tools "", --disable-slash-commands, --setting-sources "", custom --system-prompt, cwd: /tmp.
  • Body limit bump: 10mb → 50mb to accommodate base64-encoded screenshots.

How it works

Client sends image_url (base64 data URI)
  → Proxy detects hasImages=true
  → Converts to Claude CLI stream-json format:
    {"type":"user","message":{"role":"user","content":[
      {"type":"text","text":"..."},
      {"type":"image","source":{"type":"base64","media_type":"image/png","data":"..."}}
    ]}}
  → Pipes via stdin with --input-format stream-json
  → Claude CLI processes multimodal input
  → Response streamed back as SSE chunks

Backward compatibility

  • No breaking changes — text-only requests follow the same code path as before
  • Stream-json mode only activates when images are detected

Test plan

  • E2E tested with Browser Use v0.11.9 (vision-enabled browser automation)
  • Stress tested with 3 task types: data extraction, multi-page navigation, structured output
  • Verified text-only requests still work via CLI argument path
  • Verified base64 PNG screenshots pass through to Claude and get vision responses

🤖 Generated with Claude Code

Adds multimodal (text + image) support for OpenAI-compatible clients
like Browser Use that send base64 screenshots via the chat completions
endpoint.

Changes:
- Support OpenAI content arrays with text and image_url parts
- Auto-detect images and switch to --input-format stream-json mode
  (text-only requests still use the fast CLI argument path)
- Convert data URI images to Claude CLI base64 format via stdin piping
- Strip code fences from model responses (Claude wraps JSON in fences)
- Isolate CLI subprocess: --tools "", --disable-slash-commands,
  --setting-sources "", --system-prompt override, cwd /tmp
- Bump body limit to 50mb for base64 image payloads

Tested with Browser Use v0.11.9 running vision-enabled browser
automation tasks (screenshots piped as base64 PNG).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bwiedmann added a commit to bwiedmann/claude-max-api-proxy that referenced this pull request Feb 15, 2026
…am-json + CLI isolation

# Conflicts:
#	src/adapter/cli-to-openai.ts
#	src/adapter/openai-to-cli.ts
#	src/server/routes.ts
#	src/subprocess/manager.ts
#	src/types/openai.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant