Skip to content

nik-55/openenv-hackathon

Repository files navigation

title Medchain Env Environment Server
emoji 🎰
colorFrom red
colorTo yellow
sdk docker
pinned false
app_port 8000
base_path /web
tags
openenv

MedChain Env β€” Hospital Supply Chain Management

Train AI agents to keep hospitals stocked β€” where running out of blood costs lives and ordering too much costs money.


What is hospital supply chain management?

A hospital runs on supplies: surgical gloves, IV bags, blood, insulin, saline β€” hundreds of products across dozens of wards and operating theatres. Someone has to make sure the right amount of each product is in the right place at the right time. That person is the supply chain manager.

The job sounds simple but has several compounding pressures:

Orders take time to arrive. A supplier might take 2–7 days to deliver. If a ward runs out of IV bags today, you can't just call and get them in an hour. You have to have ordered them days ago β€” which means constantly forecasting future need.

Supplies expire. Blood expires in 42 days. Platelets in 5 days. Insulin in 28 days. Order too much and it rots on the shelf, wasting both money and the product. Order too little and patients go without.

Demand spikes without warning. A multi-vehicle accident sends 18 trauma patients to the ER at once. Flu season doubles demand for antivirals over two weeks. An elective surgery backlog clears and suddenly the ward needs 35% more consumables. A manager who doesn't read the situation will be caught short.

Emergencies require documentation. In a real hospital, spending extra money to rush an urgent order isn't automatic β€” it requires a written justification audited by Finance. A manager who writes "routine restock" when they're actually responding to a mass casualty event raises a red flag.

Multiple suppliers, multiple trade-offs. The cheap supplier takes 7 days; the premium supplier delivers in 1 day at 40% higher cost. When the cheap supplier announces delays due to a warehouse strike, you have to pivot β€” but if you always use the premium supplier you'll blow the budget.


What is this environment?

MedChain Env puts an AI agent in the role of a hospital supply chain manager. Each episode runs for 2–12 simulated days. Every day is a shift: the agent gets a limited number of actions to check the situation and respond before the simulation advances.

The agent interacts with a simulated legacy ERP system β€” the kind of fragmented, text-heavy enterprise software real hospitals actually use. There are three sub-systems to navigate:

  • COMMS Pager β€” the inbox. Unstructured text messages from the incident command system, suppliers, ward managers, and the pharmacy. Critical alerts arrive here: MCI activations, supplier disruptions, lot recalls.
  • Inventory DB β€” query stock levels by ward, product, and expiry date. Returns table-formatted output.
  • Procurement Portal β€” place purchase orders, file justifications for expedited spending, track deliveries.

The agent must decide what to look up (with a limited action budget), interpret what it finds, and act. There is no clean "here is everything you need to know" state dump β€” just the same messy interface a human manager would use.


Why does this matter for AI research?

Classical inventory management algorithms (like the (s, S) policy β€” "reorder when stock drops below S, order up to s") work reasonably well when demand is predictable and stable.

They fail completely when the environment sends a message like:

"Incident Command activated. Multi-vehicle accident I-95 northbound. Confirmed 18 critically injured en route. Blood bank placed on AMBER alert."

A classical algorithm sees only numbers. It cannot read that message and reason: "O-negative blood is the universal donor, expiry is 42 days but we only have 8 units left, lead time from the blood bank is 1 day, I need to order 30 units now and file a justification because this is expedited."

An LLM that reads and understands that message can respond correctly.

The heuristic→LLM performance gap in Task 3 (0.38 → 0.58) is the environment's core scientific contribution — a quantified measure of how much contextual language understanding is worth in real operational decisions.


Tasks

Four tasks of increasing difficulty. Each task is a self-contained episode with its own scenario, products, suppliers, and events.

Task 1 β€” orientation_ward (Easy, 2 days)

A single general ward, 3 non-perishable supplies, one reliable supplier with 1-day delivery. Initial stock covers only the first day. The agent's job: read the inbox to understand the situation, check what's in stock, and place at least one replenishment order before supplies run out on Day 2.

This task is purely about exploring the interface. No crises, no expiry pressure, no multi-supplier decisions.

Score formula 70% service level + 30% whether at least one order was placed
Expected scores Random agent: 0.55 Β· Heuristic: 0.88 Β· LLM: 0.95

Task 2 β€” single_ward_stable (Medium, 3 days)

One ward, 6 products (some with expiry dates), stable demand, 2-day delivery. Initial stock covers 2 days β€” an agent that waits until Day 2 to order will arrive at Day 3 with empty shelves. The task introduces cost efficiency: placing sensibly-sized orders (not over-ordering) is rewarded alongside not running out.

Score formula 50% service level + 50% cost efficiency vs. benchmark
Expected scores Random agent: 0.30 Β· Heuristic: 0.68 Β· LLM: 0.82

Task 3 β€” multi_ward_seasonal (Medium-Hard, 6 days)

Three wards plus a central pharmacy. Ten products. Two suppliers with different speed/cost trade-offs:

  • FastMed: delivers in 1 day, costs 40% more
  • MedLine: delivers in 4 days, base price

Two events unfold over the episode β€” both announced in the inbox before they hit:

Day 2 (early warning) β†’ Day 3–5 (active): Regional influenza alert. Antiviral, mask, and paracetamol demand surges 50% above normal. An agent that pre-orders on Day 2 after reading the warning is protected. An agent that ignores the warning scrambles to catch up.

Day 4–6: MedLine warehouse strike. Standard delivery extends from 4 to 7 days β€” which means any order placed after Day 4 via MedLine arrives after the episode ends. The agent must pivot to FastMed (at higher cost) for anything urgent.

The task tests whether the agent can act on early warnings rather than reacting to crises after they arrive.

Score formula 40% service level + 35% cost efficiency + 15% capacity management + 10% transfer efficiency
Expected scores Random: 0.22 Β· Heuristic (ignores alerts): 0.55 Β· LLM (reads alerts): 0.73

Task 4 β€” hospital_network_crisis (Hard, 12 days)

A full regional network: 3 hospitals plus a regional distribution centre, 15 products including life-critical perishables: O-negative blood (universal donor, expires in 42 days), platelets (expires in 5 days), fresh frozen plasma. Budget ceiling: $150,000 outstanding at any time.

Five crisis events unfold across the 12-day episode β€” some overlapping:

Day Event
3 Cold chain breach at the regional DC β€” refrigeration failure destroys all platelet inventory. An alert fires. Agent must order replacements immediately or hospitals run out within days.
6–14 Supplier force majeure β€” HealthCo Supplies lead time extends from 3 to 7 days due to flu absenteeism. Agent must switch to the premium express supplier for urgent items.
8 (warning) β†’ 9–11 (active) Mass casualty incident β€” large multi-vehicle accident. Blood product demand triples across all hospitals for 3 days. An agent that reads the Day 8 warning and pre-orders blood survives this; one that waits until Day 9 faces critical stockouts.
11 Mandatory product recall β€” a specific lot of IV Saline is flagged by the health authority. The agent must find which locations hold that lot, quarantine all units, and order replacements. Failing to quarantine by end of shift is a patient safety failure.

This task also introduces the paper trail mechanic: any expedited (rush) order triggers a mandatory written justification that Finance reviews. If the agent writes "routine restock" when there is an active mass casualty event, the justification is flagged as incoherent and a score penalty applies.

Score formula 35% service level + 25% cost efficiency + 20% critical product availability + 15% waste reduction (expired product value) + 5% crisis response speed
Justification penalty βˆ’0.05 per incoherent expedited justification (max βˆ’0.15)
Expected scores Random: 0.12 Β· Heuristic (no alert reading): 0.38 Β· LLM: 0.58 Β· Near-optimal: 0.82

How scoring works

Each task produces a score between 0.0 and 1.0 computed deterministically from the simulation history. There is no LLM judge.

Partial credit during the episode

Small reward signals after each action reinforce useful behaviour:

  • Reading the inbox or checking inventory (first time each shift): +0.01
  • Successfully placing an order: +0.02
  • Executing a transfer or quarantine: +0.01
  • Filing a coherent justification: +0.01
  • Filing an incoherent justification (flagged by Finance): βˆ’0.05

Terminal score at episode end

The main score is computed when the episode completes. Components vary by task:

  • Service level β€” what fraction of demand was actually fulfilled across all products, locations, and days
  • Cost efficiency β€” how close actual spend was to a reasonable benchmark (spending less than the benchmark is good; spending far over it penalises over-ordering)
  • Critical product availability (Task 4) β€” blood and platelets tracked separately; running out of these carries heavy penalty
  • Waste fraction (Task 4) β€” value of expired inventory divided by total spend; rewards active expiry management
  • Crisis response score (Task 4) β€” how quickly the agent positioned blood during the MCI window and handled the recall

How the agent interacts β€” the 9 tools

The agent has 9 tools, each consuming one action from the shift budget:

Tool What it does
read_inbox Read messages from the COMMS pager (filter: unread / all / flagged)
query_erp Query stock levels, expiry dates, pipeline orders, or demand history
query_supplier Get a supplier's current lead time and any disruption notices
query_forecast Request a demand forecast for a product at a location
submit_po Place a purchase order (standard or expedited)
transfer Move stock from one location to another
quarantine_lot Isolate a specific inventory lot (for recalls or cold chain failures)
file_justification Write the Finance audit reason for an expedited order
end_shift Close the shift β€” simulation advances by one day

The action budget is the core constraint. Each shift, the agent gets 5–10 actions depending on the task. Using all 10 actions to query every product at every location leaves no budget to place orders. The agent must triage: what is most urgent to check right now?


How the LLM agent works (inference.py)

The inference script runs a multi-turn LLM agent using the OpenAI API format. Any model accessible through an OpenAI-compatible endpoint works.

Each shift is one LLM conversation:

  1. The agent sees the shift dashboard (what day it is, how many actions remain, inbox alerts count)
  2. The LLM picks a tool to call
  3. The tool result is appended to the conversation
  4. Loop repeats until the agent calls end_shift() or runs out of action budget
  5. At end_shift(), the conversation is compressed into a short summary before the next shift begins

Context window management: After each shift, the history is pruned down to the system prompt + all past shift summaries + the last 6 messages from the current shift. This keeps context bounded regardless of how many days the episode runs.

System prompt guidance: The agent is told to prioritise: read inbox β†’ check inventory β†’ place orders β†’ end shift. It's given explicit guidance on lead time arithmetic, expiry rotation, how to respond to MCI alerts, and when expedited orders are warranted.


Setup & Running

Prerequisites

  • Docker
  • Python 3.10+ with uv

Build the Docker image

docker build -t <LOCAL_IMAGE_NAME> -f server/Dockerfile .

Set environment variables and run

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=openai/gpt-oss-120b:groq
export HF_TOKEN=<your_token>
export LOCAL_IMAGE_NAME=nik-55_medchain-openenv

uv run python inference.py

The script runs all 4 tasks in sequence and emits structured logs to stdout:

[START] task=orientation_ward env=medchain model=openai/gpt-oss-120b:groq
[STEP]  step=1 action=read_inbox({}) reward=0.01 done=false error=null
[STEP]  step=2 action=query_erp({...}) reward=0.01 done=false error=null
...
[END]   success=true steps=10 score=0.923 rewards=0.01,0.01,0.02,...

Environment variables

Variable Description
API_BASE_URL LLM API endpoint (OpenAI-compatible)
MODEL_NAME Model identifier
HF_TOKEN or API_KEY API authentication key
LOCAL_IMAGE_NAME Docker image tag used by inference.py to launch containers
SLEEP_BETWEEN_STEPS Seconds between LLM calls, default 2
LOG_LEVEL INFO (default) or DEBUG (writes timestamped log to logs/)

Project Structure

medchain_env/
β”œβ”€β”€ inference.py                     # Entry point β€” LLM agent runs all 4 tasks
β”œβ”€β”€ client.py                        # MedchainEnv OpenEnv client
β”œβ”€β”€ models.py                        # State and observation types
β”œβ”€β”€ openenv.yaml                     # OpenEnv manifest
β”œβ”€β”€ pyproject.toml                   # Dependencies
└── server/
    β”œβ”€β”€ app.py                       # FastAPI application
    β”œβ”€β”€ Dockerfile                   # Container build
    β”œβ”€β”€ medchain_env_environment.py  # OpenEnv Environment + 9 MCP tools
    β”œβ”€β”€ simulation.py                # Simulation engine (inventory, demand, events)
    β”œβ”€β”€ tasks.py                     # Task configurations
    β”œβ”€β”€ grader.py                    # Terminal reward computation
    └── erp_formatter.py             # ERP text output formatters

About

Submission for hackathon https://www.scaler.com/school-of-technology/meta-pytorch-hackathon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors