Bridge GitHub Copilot with local vLLM/TGI servers and HuggingFace cloud models
- β¬οΈ Download v1.0.0 VSIX (92KB)
- Install in VS Code:
- Press
Ctrl+Shift+P(orCmd+Shift+Pon macOS) - Type
Extensions: Install from VSIX... - Select the downloaded
.vsixfile
- Press
- Restart VS Code and select models in GitHub Copilot Chat
# Download and install in one command
wget https://github.com/dzivkovi/vllm-huggingface-bridge/releases/download/v1.0.0/vllm-huggingface-bridge-1.0.0.vsix
code --install-extension vllm-huggingface-bridge-1.0.0.vsix# Remove original HuggingFace extension if installed
code --uninstall-extension HuggingFace.huggingface-vscode-chat
# Remove old vLLM Community version if installed
code --uninstall-extension vllm-community.vllm-huggingface-bridge- π Air-Gapped Ready: Complete offline operation with local vLLM/TGI servers
- π Dual Mode: Seamlessly switch between local and cloud models
- β‘ Optimized: 92KB package size (91% smaller than original)
- π‘οΈ Enterprise Ready: Production-tested in secure environments
- π§ Zero Config: Works out-of-the-box with sensible defaults
- π Smart Token Management: Automatic allocation for small context models
For secure, on-premise environments where data cannot leave your network:
- Start your local vLLM or TGI server (see setup instructions below)
- Configure VS Code settings:
"huggingface.localEndpoint": "http://your-server:8000" - Select your local model from the GitHub Copilot Chat model picker
- No API keys required, all processing stays on your infrastructure
- Install the extension: π¦ Download VSIX
- Open VS Code's chat interface.
- Click the model picker and click "Manage Models...".
- Select "Hugging Face" provider.
- Provide your Hugging Face Token, you can get one in your settings page. You only need to give it the inference.serverless permissions.
- Choose the models you want to add to the model picker. π₯³
Production Ready: Successfully deployed in enterprise air-gapped environments.
- Data Security: All data remains on your infrastructure
- Air-Gapped Operation: No internet connectivity required
- Low Latency: Direct connection to local GPU servers
- Cost Control: No per-token API charges
- Compliance: Meet strict data residency requirements
# Start vLLM (tested with RTX 4060, 8GB VRAM, DeepSeek-Coder 6.7B)
docker run -d --name vllm-server \
--gpus all \
--shm-size=4g \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model TheBloke/deepseek-coder-6.7B-instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.85// .vscode/settings.json
{
"huggingface.localEndpoint": "http://localhost:8000",
// CRITICAL for small context models (2048 tokens):
"github.copilot.chat.editor.temporalContext.enabled": false,
"github.copilot.chat.edits.temporalContext.enabled": false,
"github.copilot.chat.edits.suggestRelatedFilesFromGitHistory": false
}- 2048 context models ARE usable with the settings above
- vLLM adds ~500 tokens for chat template formatting
- Extension automatically adjusts token allocation
- Responses limited to 50-100 tokens when near limits
- For best experience: Use 8K+ context models
- Access SoTA open-source LLMs with tool calling capabilities.
- Single API to switch between multiple providers: Cerebras, Cohere, Fireworks AI, Groq, HF Inference, Hyperbolic, Nebius, Novita, Nscale, SambaNova, Together AI, and more. See the full list of partners in the Inference Providers docs.
- Built for high availability (across providers) and low latency.
- Local Inference Support: Run vLLM or TGI servers on-premise for air-gapped deployments
- Transparent pricing: what the provider charges is what you pay.
π‘ The free Hugging Face user tier gives you a small amount of monthly inference credits to experiment. Upgrade to Hugging Face PRO or Enterprise for $2 in monthly credits plus pay-as-you-go access across all providers!
- VS Code 1.104.0 or higher.
- Hugging Face access token with
inference.serverlesspermissions.
git clone https://github.com/huggingface/huggingface-vscode-chat
cd huggingface-vscode-chat
npm install
npm run compilePress F5 to launch an Extension Development Host.
Common scripts:
- Build:
npm run compile - Watch:
npm run watch - Lint:
npm run lint - Format:
npm run format - Quick rebuild:
scripts/rebuild-extension.sh - Test vLLM:
scripts/test-vllm.sh
π For detailed guides, see our comprehensive documentation
This extension supports connecting to your own local inference servers for private model hosting.
docker run -d --name vllm-server \
--gpus all \
--shm-size=4g \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model TheBloke/deepseek-coder-6.7B-instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.85 \
--max-model-len 2048 \
--max-num-seqs 16 \
--disable-log-stats--shm-size=4g- Without this, vLLM crashes--ipc=host- Without this, GPU communication fails--max-model-len 2048- Without this, runs out of memory
- Open Settings (Ctrl+,)
- Search for "huggingface.localEndpoint"
- Set value:
http://localhost:8000 - Reload VS Code
- Start: Click
βΆοΈ on container in Docker Desktop - Stop: Click βΉοΈ on container in Docker Desktop
- Logs: Click container name to view logs
- Remove: Stop first, then click ποΈ
curl http://localhost:8000/v1/models
# Should return: TheBloke/deepseek-coder-6.7B-instruct-AWQFull Setup Guide: vLLM Setup Guide Model Selection: Choose models for your GPU
- Open VS Code Settings (Ctrl+,)
- Search for "huggingface.localEndpoint"
- Enter your TGI server URL (e.g.,
http://192.168.1.100:8080) - See TGI Setup Guide for legacy support
- Inference Providers documentation: https://huggingface.co/docs/inference-providers/index
- VS Code Chat Provider API: https://code.visualstudio.com/api/extension-guides/ai/language-model-chat-provider
- TGI Documentation: https://huggingface.co/docs/text-generation-inference
- Open issues: https://github.com/huggingface/huggingface-vscode-chat/issues
- License: MIT License Copyright (c) 2025 Hugging Face
