Skip to content

brainqub3/inference_engine_example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inference Engine

Streamlit UI for redacting personally identifiable information (PII) from emails by proxying requests to a remote vLLM server. The frontend sends each email to the model with a built‑in system prompt that replaces sensitive spans with [redacted] while keeping all other text verbatim.

Prerequisites

  • Anaconda or Miniconda
  • Python 3.11 (installed via Conda)
  • (Optional) NVIDIA GPU + CUDA drivers if you want to host your own vLLM inference server
  • Hugging Face access token with permission to download the target model

Local Setup (Anaconda)

# 1. Clone the repo and enter it
git clone <your fork or repo URL>
cd "C:\Users\johna\OneDrive\Documents\Brainqub3\Inference Engine\inference_engine"

# 2. Create/activate the Conda env (only once)
conda create -n private-inference python=3.11 -y
conda activate private-inference

# 3. Install app dependencies
pip install -r requirements.txt

# 4. Configure environment variables
copy .env.example .env   # edit the values to point at your vLLM endpoint + key

# 5. Run the Streamlit UI
streamlit run app/streamlit_app.py --logger.level info

The Makefile mirrors those steps if you prefer make env, make deps, and make chat (Make sure conda run finds the llm-chat environment or override with ENV_NAME=private-inference make chat).

Configuration

  • VLLM_BASE_URL – OpenAI-compatible base URL (e.g., https://xxxx.proxy.runpod.net/v1)
  • OPENAI_API_KEY – Any non-empty string; vLLM just requires the header
  • MODEL_ID – Model name known to the vLLM server (e.g., ibm-granite/granite-4.0-h-1b)

Copy .env.example to .env and edit those values before starting Streamlit.

Running

  1. Activate the Conda environment: conda activate private-inference
  2. Optional: pip install -r requirements.txt if dependencies changed
  3. Launch Streamlit: streamlit run app/streamlit_app.py --logger.level info
  4. Watch the PowerShell window for logs confirming each request (HTTP 200 / errors)

Deploying vLLM on RunPod

Use the official RunPod vLLM inference template (preloaded with GPU drivers and vLLM):

Steps:

  1. Click the template link and choose an appropriate GPU pod type.
  2. Provide your HUGGING_FACE_HUB_TOKEN as an environment variable so the model can download.
  3. Set container command/args similar to:
    python3 -m vllm.entrypoints.openai.api_server \
      --model ibm-granite/granite-4.0-h-1b \
      --host 0.0.0.0 \
      --port 8000
    
  4. Start the pod and wait for the health indicator to turn green.
  5. Copy the forwarded URL (https://<pod-id>-8000.proxy.runpod.net/v1) into your local .env VLLM_BASE_URL.

You can also build/push your own container via the provided Dockerfile if you need custom dependencies, then point RunPod at that image instead of the template default.

About

An inference engine that supports vLLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published