Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations
This repository implements AIR, a benchmark for evaluating how LLM agents learn and adapt through experiential reasoning in multi-episode product recommendation scenarios. AIR challenges agents to improve performance across episodes by learning through natural language interactions rather than through explicit parameter updates.
AIR evaluates an agent's ability to perform in-context experiential reasoning. Specifically, agents must:
- Elicit latent user preferences through strategic questioning
- Navigate evolving product landscapes and user needs
- Leverage cross-episode memory to improve recommendations
- Manage uncertainty in incomplete information environments
- Real-world Products: 71K+ Amazon items across 2K+ categories with rich metadata
- Diverse Personas: 40K+ user profiles with varied, latent preferences and demographics
- LLM User Simulator: Realistic interaction trajectories powered by persona-driven response generation
├── pipeline/ # Core framework
│ ├── core/ # Personas, agents, LLM providers, scoring
│ │ └── llm_providers/ # OpenAI, Claude, Gemini integrations
│ ├── envs/ # Recommendation environment (Gymnasium)
│ └── wrappers/ # Metrics, feedback, logging
├── experiments/ # Experiment orchestration and baselines
├── experiment_runners/ # Configuration and launch scripts
│ └── configs/ # YAML configuration files
├── config/ # Configuration dataclasses (Python code)
├── database/ # Product database, caching, HuggingFace sync
├── database_creation/ # Scripts for categorizing/processing products
├── data/ # Personas, product mappings, trajectories
├── graphing/ # Visualization and analysis tools
├── webpage/ # Interactive leaderboard and submission interface
All experiments use YAML configs with 31 parameters covering:
- Experiment setup (type, episodes, trajectories, seeds)
- Agent parameters (model, temperature, max questions)
- Context modes (raw, summary, planning)
- Feedback types (persona, oracle, LLM-based)
- Checkpointing and resumption
- Interactive trajectory generation
Example: experiment_runners/config_reference.yaml documents all parameters.
AIR supports three experimental paradigms to isolate different adaptation challenges:
variable_category: Fixed persona, varying product categories (preference generalization)variable_persona: Fixed category, varying user personas (user adaptation)variable_settings: Both persona and category vary (full adaptation)
Planning modes force the agent to give a recommendation after each question within an episode, enabling analysis of within-episode improvement and whether this learning rate increases across later episodes.
planning_no_strat: Non-modified experimentplanning_greedy: Greedy question selectionplanning_dp: Dynamic programming-style lookahead
planning_mode: "planning_dp"
planning_interval: 5Generate multiple trajectory variants for manual curation:
- System produces N variants of Episode 1
- User selects preferred variant
- System generates N variants of Episode 2 from selected Episode 1
- Repeat until trajectory complete
interactive_mode: true
interactive_variants: 10
interactive_input_file: "episode_01_variant_003.json" # For continuation- Python 3.9+
- ~1GB disk space (500MB database + dependencies)
- API keys for LLM providers (at least OpenAI and Google for scoring):
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)
1. Clone Repository
git clone https://github.com/namkoong-lab/personas.git
cd personas2. Install Dependencies
pip install -r requirements.txt3. Configure API Keys
Create a .env file in the project root:
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...4. Database Setup
The product database is hosted on HuggingFace and will automatically download on first run.
# Just run any experiment - database downloads automatically
cd experiment_runners
python run_experiment.py --config configs/basic_variable_category.yamlOn first run, you'll see:
🔄 Database not found. Downloading from HuggingFace...
📦 Downloading products_part1.parquet (4/4)...
✅ Database setup complete!
# Pre-download database before running experiments
cd database
python setup_database.pyThis downloads 4 Parquet files (~500MB total) from HuggingFace and builds a local SQLite database.
The AIR database contains:
- 71,088 products from Amazon with rich metadata
- 2,030 product categories organized into substitute sets
- Product attributes: titles, prices, ratings, descriptions, images
- Score cache: Stores persona-product scores to avoid re-computation
Database Structure:
database/
├── personas.db # SQLite database (auto-generated)
├── setup_database.py # Download script
└── cache/ # Downloaded Parquet files
├── products_part1.parquet
├── products_part2.parquet
├── products_part3.parquet
└── products_part4.parquet
Score Caching: The database includes a persona_scores table that grows during experiments. Cached scores are reused across runs, speeding up repeated experiments with the same personas/categories.
cd experiment_runners
# Run with example config
python run_experiment.py --config configs/basic_variable_category.yaml
# Interactive trajectory building
python run_experiment.py --config configs/interactive_example.yamlcheckpoint_enabled: true
resume_from_checkpoint: "experiment_results/checkpoint_traj2_ep8.json"AIR's modular architecture makes it easy to benchmark your own LLM models and agents.
Integrate a new LLM API (e.g., Cohere, Mistral, local models) in 4 steps:
- Copy the template: Use
pipeline/core/llm_providers/custom_provider_template.pyas a starting point - Implement two methods:
__init__(): Load API key and initialize clientchat_completion(): Make API calls with retry logic
- Register your provider in
pipeline/core/llm_providers/__init__.py - Add API key to
.envand use in your config
See: custom_provider_template.py for detailed implementation guide and examples from openai_provider.py, claude_provider.py, gemini_provider.py.
Test your provider:
from pipeline.core.llm_providers import chat_completion
response = chat_completion(
model="my-model-v1",
messages=[{"role": "user", "content": "Hello!"}]
)For advanced agent behavior (custom prompting, tool use, RAG), extend UnifiedAgent:
- Create custom agent in
pipeline/core/my_custom_agent.py - Override methods:
decide_action(): Custom decision logic_build_llm_context(): Custom prompt construction- Add pre/post-processing (tool calls, retrieval, etc.)
- Modify experiment runner to use your agent class
- Test on small scale before full experiments
Key extension points:
decide_action(): Control when to ask vs recommend_build_llm_context(): Customize product/dialog presentation_llm_decide_action(): Override core LLM prompting- Add external knowledge, tools, or multi-step reasoning
@article{yang2025bela,
title={Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations},
author={Yang, Gilbert and Chen, Yaqin and Yen, Thomson and Namkoong, Hongseok},
year={2025}
}MIT License - see LICENSE for details.