This repository contains the code for the paper "LLMs Can't Play Hangman: On the Necessity of a Private Working Memory for Language Agents".
- Builds and evaluates conversational agents with private working memory that persists across turns.
- Proposes two agent paradigms with private memory: WorkflowAgent (two-LLM responder/updater) and ReActMemAgent (single-LLM ReAct with tools), plus CoT/stateless variants.
- Compares against external memory baselines: Mem0, A-Mem, LightMem, and MemoryOS.
- Runs batch experiments (Hangman SCT, Diagnosis Simulator SCT), logs full interactions, and evaluates.
- Uses OpenRouter API or vLLM for model inference.
Two-LLM, two-stage agent:
- Responder LLM produces the public reply (optionally with
<think>reasoning). - Updater LLM returns STRICT-JSON tool calls which are executed to update private
working_memory. - Persists conversation and memory via LangGraph checkpoints.
Single-LLM ReAct with memory-edit tools:
- Model may emit tool calls inline; the agent executes memory tools sequentially and persists updated
working_memory. - History is pruned per turn to drop within-turn tool chatter while keeping the final AI message.
Both WorkflowAgent and ReActMemAgent can use one of three strategies for updating private working_memory:
- overwrite:
overwrite_memory - patch_and_replace:
patch_memoryand/orreplace_in_memory - append_and_delete:
append_in_memoryand/ordelete_from_memory
PrivateCoTAgent(thinking stored privately)VanillaLLMAgent(plain chat, no memory)
We compare against the following external memory baselines:
| Baseline | Description | Repository |
|---|---|---|
| Mem0 | Universal memory layer for AI Agents | mem0ai/mem0 |
| A-Mem | Agentic Memory for LLM Agents | agiresearch/A-mem |
| LightMem | Lightweight memory system | zjunlp/LightMem |
| MemoryOS | Operating system-inspired memory management | BAI-LAB/MemoryOS |
config/config.yamldeclares LLM endpoints, model names, parsing format (think tags vs direct), and generation params.- Backends supported by
LLMProvider:openrouter_sdk: OpenRouter via the OpenAI SDK; optional reasoning text is wrapped into<think>for uniform parsing.vllm_native: custom FastAPI server for local deployment with two-pass generation and tool-call support.- OpenAI-compatible HTTP clients (default fallback) for other compatible servers.
LLMProviderparses optional<think>...</think>text (whenparsing_format: think_tags) intothinkingvs publicresponse.
- Deterministic players (
DeterministicHangmanPlayer,DeterministicDiagnosisSimulatorPlayer) simulate interactions for SCT evaluation.
SCTController(for Hangman) andDiagnosisSCTController(for Diagnosis Simulator) manage the Self-Consistency Test evaluation flow.- End-of-run, evaluators compute SCT metrics and merge results into JSON logs.
SCTEvaluatorcomputes Self-Consistency Test metrics from trial logs.HybridEvaluatorsupports LLM-based judging for behavioral/memory modes and rule-based metrics.
- providers:
LLMProvider,load_llm_provider(OpenRouter / native vLLM / OpenAI-compatible) - agents:
BaseAgent+ variants (WorkflowAgent,ReActMemAgent,PrivateCoTAgent,VanillaLLMAgent,Mem0Agent,AMemAgent,LightMemAgent,MemoryOSAgent) - players: Deterministic players for SCT evaluation
- games:
HangmanSCTGame,DiagnosisSimulatorSCTGame - engine:
SCTController,DiagnosisSCTController(turn loop, JSON logs, evaluator) - evaluation:
SCTEvaluator,HybridEvaluator+ prompt registry - prompts: game- and agent-specific prompt templates
Prerequisites: Python 3.11+ and Poetry installed.
Clone and enter the repo:
git clone https://github.com/chandar-lab/Hangman.git
cd HangmanInstall dependencies (Poetry creates .venv and installs from pyproject.toml):
poetry installActivate the virtualenv:
source ./.venv/bin/activateSet up your OpenRouter API key by creating a .env file in the project root:
OPENROUTER_API_KEY=your_api_key_hereRun Hangman SCT experiments:
python run_sct_hangman.py --run-config ./config/hangman_sct_qwen3_run.yaml --providers-config ./config/config.yamlRun Diagnosis Simulator SCT experiments:
python run_sct_ds.py --run-config ./config/diagnosis_simulator_sct_qwen3_run.yaml --providers-config ./config/config.yamlQuick chat loops (venv activated):
python src/hangman/agents/workflow_agent.pyor
python src/hangman/agents/reactmem_agent.pyJSON logs are written to results/<game>/<agent>/... with:
metadata: game, agent, provider configs, timestampsinteraction_log:[utterance, private_state]per turnevaluation: SCT and evaluator scores per metric
To use Mem0-based agents, you need to run a Qdrant vector database:
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrantIf you use this code, please cite our paper.
@misc{
baldelli2026llmscantplayhangman,
title={LLMs Can't Play Hangman: On the Necessity of a Private Working Memory for Language Agents},
author={Davide Baldelli and Ali Parviz and Amal Zouaq and Sarath Chandar},
year={2026},
eprint={2601.06973},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.06973},
}