| title | Medchain Env Environment Server | |
|---|---|---|
| emoji | π° | |
| colorFrom | red | |
| colorTo | yellow | |
| sdk | docker | |
| pinned | false | |
| app_port | 8000 | |
| base_path | /web | |
| tags |
|
Train AI agents to keep hospitals stocked β where running out of blood costs lives and ordering too much costs money.
A hospital runs on supplies: surgical gloves, IV bags, blood, insulin, saline β hundreds of products across dozens of wards and operating theatres. Someone has to make sure the right amount of each product is in the right place at the right time. That person is the supply chain manager.
The job sounds simple but has several compounding pressures:
Orders take time to arrive. A supplier might take 2β7 days to deliver. If a ward runs out of IV bags today, you can't just call and get them in an hour. You have to have ordered them days ago β which means constantly forecasting future need.
Supplies expire. Blood expires in 42 days. Platelets in 5 days. Insulin in 28 days. Order too much and it rots on the shelf, wasting both money and the product. Order too little and patients go without.
Demand spikes without warning. A multi-vehicle accident sends 18 trauma patients to the ER at once. Flu season doubles demand for antivirals over two weeks. An elective surgery backlog clears and suddenly the ward needs 35% more consumables. A manager who doesn't read the situation will be caught short.
Emergencies require documentation. In a real hospital, spending extra money to rush an urgent order isn't automatic β it requires a written justification audited by Finance. A manager who writes "routine restock" when they're actually responding to a mass casualty event raises a red flag.
Multiple suppliers, multiple trade-offs. The cheap supplier takes 7 days; the premium supplier delivers in 1 day at 40% higher cost. When the cheap supplier announces delays due to a warehouse strike, you have to pivot β but if you always use the premium supplier you'll blow the budget.
MedChain Env puts an AI agent in the role of a hospital supply chain manager. Each episode runs for 2β12 simulated days. Every day is a shift: the agent gets a limited number of actions to check the situation and respond before the simulation advances.
The agent interacts with a simulated legacy ERP system β the kind of fragmented, text-heavy enterprise software real hospitals actually use. There are three sub-systems to navigate:
- COMMS Pager β the inbox. Unstructured text messages from the incident command system, suppliers, ward managers, and the pharmacy. Critical alerts arrive here: MCI activations, supplier disruptions, lot recalls.
- Inventory DB β query stock levels by ward, product, and expiry date. Returns table-formatted output.
- Procurement Portal β place purchase orders, file justifications for expedited spending, track deliveries.
The agent must decide what to look up (with a limited action budget), interpret what it finds, and act. There is no clean "here is everything you need to know" state dump β just the same messy interface a human manager would use.
Classical inventory management algorithms (like the (s, S) policy β "reorder when stock drops below S, order up to s") work reasonably well when demand is predictable and stable.
They fail completely when the environment sends a message like:
"Incident Command activated. Multi-vehicle accident I-95 northbound. Confirmed 18 critically injured en route. Blood bank placed on AMBER alert."
A classical algorithm sees only numbers. It cannot read that message and reason: "O-negative blood is the universal donor, expiry is 42 days but we only have 8 units left, lead time from the blood bank is 1 day, I need to order 30 units now and file a justification because this is expedited."
An LLM that reads and understands that message can respond correctly.
The heuristicβLLM performance gap in Task 3 (0.38 β 0.58) is the environment's core scientific contribution β a quantified measure of how much contextual language understanding is worth in real operational decisions.
Four tasks of increasing difficulty. Each task is a self-contained episode with its own scenario, products, suppliers, and events.
A single general ward, 3 non-perishable supplies, one reliable supplier with 1-day delivery. Initial stock covers only the first day. The agent's job: read the inbox to understand the situation, check what's in stock, and place at least one replenishment order before supplies run out on Day 2.
This task is purely about exploring the interface. No crises, no expiry pressure, no multi-supplier decisions.
| Score formula | 70% service level + 30% whether at least one order was placed |
| Expected scores | Random agent: 0.55 Β· Heuristic: 0.88 Β· LLM: 0.95 |
One ward, 6 products (some with expiry dates), stable demand, 2-day delivery. Initial stock covers 2 days β an agent that waits until Day 2 to order will arrive at Day 3 with empty shelves. The task introduces cost efficiency: placing sensibly-sized orders (not over-ordering) is rewarded alongside not running out.
| Score formula | 50% service level + 50% cost efficiency vs. benchmark |
| Expected scores | Random agent: 0.30 Β· Heuristic: 0.68 Β· LLM: 0.82 |
Three wards plus a central pharmacy. Ten products. Two suppliers with different speed/cost trade-offs:
- FastMed: delivers in 1 day, costs 40% more
- MedLine: delivers in 4 days, base price
Two events unfold over the episode β both announced in the inbox before they hit:
Day 2 (early warning) β Day 3β5 (active): Regional influenza alert. Antiviral, mask, and paracetamol demand surges 50% above normal. An agent that pre-orders on Day 2 after reading the warning is protected. An agent that ignores the warning scrambles to catch up.
Day 4β6: MedLine warehouse strike. Standard delivery extends from 4 to 7 days β which means any order placed after Day 4 via MedLine arrives after the episode ends. The agent must pivot to FastMed (at higher cost) for anything urgent.
The task tests whether the agent can act on early warnings rather than reacting to crises after they arrive.
| Score formula | 40% service level + 35% cost efficiency + 15% capacity management + 10% transfer efficiency |
| Expected scores | Random: 0.22 Β· Heuristic (ignores alerts): 0.55 Β· LLM (reads alerts): 0.73 |
A full regional network: 3 hospitals plus a regional distribution centre, 15 products including life-critical perishables: O-negative blood (universal donor, expires in 42 days), platelets (expires in 5 days), fresh frozen plasma. Budget ceiling: $150,000 outstanding at any time.
Five crisis events unfold across the 12-day episode β some overlapping:
| Day | Event |
|---|---|
| 3 | Cold chain breach at the regional DC β refrigeration failure destroys all platelet inventory. An alert fires. Agent must order replacements immediately or hospitals run out within days. |
| 6β14 | Supplier force majeure β HealthCo Supplies lead time extends from 3 to 7 days due to flu absenteeism. Agent must switch to the premium express supplier for urgent items. |
| 8 (warning) β 9β11 (active) | Mass casualty incident β large multi-vehicle accident. Blood product demand triples across all hospitals for 3 days. An agent that reads the Day 8 warning and pre-orders blood survives this; one that waits until Day 9 faces critical stockouts. |
| 11 | Mandatory product recall β a specific lot of IV Saline is flagged by the health authority. The agent must find which locations hold that lot, quarantine all units, and order replacements. Failing to quarantine by end of shift is a patient safety failure. |
This task also introduces the paper trail mechanic: any expedited (rush) order triggers a mandatory written justification that Finance reviews. If the agent writes "routine restock" when there is an active mass casualty event, the justification is flagged as incoherent and a score penalty applies.
| Score formula | 35% service level + 25% cost efficiency + 20% critical product availability + 15% waste reduction (expired product value) + 5% crisis response speed |
| Justification penalty | β0.05 per incoherent expedited justification (max β0.15) |
| Expected scores | Random: 0.12 Β· Heuristic (no alert reading): 0.38 Β· LLM: 0.58 Β· Near-optimal: 0.82 |
Each task produces a score between 0.0 and 1.0 computed deterministically from the simulation history. There is no LLM judge.
Small reward signals after each action reinforce useful behaviour:
- Reading the inbox or checking inventory (first time each shift): +0.01
- Successfully placing an order: +0.02
- Executing a transfer or quarantine: +0.01
- Filing a coherent justification: +0.01
- Filing an incoherent justification (flagged by Finance): β0.05
The main score is computed when the episode completes. Components vary by task:
- Service level β what fraction of demand was actually fulfilled across all products, locations, and days
- Cost efficiency β how close actual spend was to a reasonable benchmark (spending less than the benchmark is good; spending far over it penalises over-ordering)
- Critical product availability (Task 4) β blood and platelets tracked separately; running out of these carries heavy penalty
- Waste fraction (Task 4) β value of expired inventory divided by total spend; rewards active expiry management
- Crisis response score (Task 4) β how quickly the agent positioned blood during the MCI window and handled the recall
The agent has 9 tools, each consuming one action from the shift budget:
| Tool | What it does |
|---|---|
read_inbox |
Read messages from the COMMS pager (filter: unread / all / flagged) |
query_erp |
Query stock levels, expiry dates, pipeline orders, or demand history |
query_supplier |
Get a supplier's current lead time and any disruption notices |
query_forecast |
Request a demand forecast for a product at a location |
submit_po |
Place a purchase order (standard or expedited) |
transfer |
Move stock from one location to another |
quarantine_lot |
Isolate a specific inventory lot (for recalls or cold chain failures) |
file_justification |
Write the Finance audit reason for an expedited order |
end_shift |
Close the shift β simulation advances by one day |
The action budget is the core constraint. Each shift, the agent gets 5β10 actions depending on the task. Using all 10 actions to query every product at every location leaves no budget to place orders. The agent must triage: what is most urgent to check right now?
The inference script runs a multi-turn LLM agent using the OpenAI API format. Any model accessible through an OpenAI-compatible endpoint works.
Each shift is one LLM conversation:
- The agent sees the shift dashboard (what day it is, how many actions remain, inbox alerts count)
- The LLM picks a tool to call
- The tool result is appended to the conversation
- Loop repeats until the agent calls
end_shift()or runs out of action budget - At
end_shift(), the conversation is compressed into a short summary before the next shift begins
Context window management: After each shift, the history is pruned down to the system prompt + all past shift summaries + the last 6 messages from the current shift. This keeps context bounded regardless of how many days the episode runs.
System prompt guidance: The agent is told to prioritise: read inbox β check inventory β place orders β end shift. It's given explicit guidance on lead time arithmetic, expiry rotation, how to respond to MCI alerts, and when expedited orders are warranted.
- Docker
- Python 3.10+ with
uv
docker build -t <LOCAL_IMAGE_NAME> -f server/Dockerfile .export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=openai/gpt-oss-120b:groq
export HF_TOKEN=<your_token>
export LOCAL_IMAGE_NAME=nik-55_medchain-openenv
uv run python inference.pyThe script runs all 4 tasks in sequence and emits structured logs to stdout:
[START] task=orientation_ward env=medchain model=openai/gpt-oss-120b:groq
[STEP] step=1 action=read_inbox({}) reward=0.01 done=false error=null
[STEP] step=2 action=query_erp({...}) reward=0.01 done=false error=null
...
[END] success=true steps=10 score=0.923 rewards=0.01,0.01,0.02,...
| Variable | Description |
|---|---|
API_BASE_URL |
LLM API endpoint (OpenAI-compatible) |
MODEL_NAME |
Model identifier |
HF_TOKEN or API_KEY |
API authentication key |
LOCAL_IMAGE_NAME |
Docker image tag used by inference.py to launch containers |
SLEEP_BETWEEN_STEPS |
Seconds between LLM calls, default 2 |
LOG_LEVEL |
INFO (default) or DEBUG (writes timestamped log to logs/) |
medchain_env/
βββ inference.py # Entry point β LLM agent runs all 4 tasks
βββ client.py # MedchainEnv OpenEnv client
βββ models.py # State and observation types
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Dependencies
βββ server/
βββ app.py # FastAPI application
βββ Dockerfile # Container build
βββ medchain_env_environment.py # OpenEnv Environment + 9 MCP tools
βββ simulation.py # Simulation engine (inventory, demand, events)
βββ tasks.py # Task configurations
βββ grader.py # Terminal reward computation
βββ erp_formatter.py # ERP text output formatters