AIR: Agentic In-context Experiential Reasoning

Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations

This repository implements AIR, a benchmark for evaluating how LLM agents learn and adapt through experiential reasoning in multi-episode product recommendation scenarios. AIR challenges agents to improve performance across episodes by learning through natural language interactions rather than through explicit parameter updates.

Overview

AIR evaluates an agent's ability to perform in-context experiential reasoning. Specifically, agents must:

Elicit latent user preferences through strategic questioning
Navigate evolving product landscapes and user needs
Leverage cross-episode memory to improve recommendations
Manage uncertainty in incomplete information environments

Core Components

Real-world Products: 71K+ Amazon items across 2K+ categories with rich metadata
Diverse Personas: 40K+ user profiles with varied, latent preferences and demographics
LLM User Simulator: Realistic interaction trajectories powered by persona-driven response generation

Project Structure

├── pipeline/                 # Core framework
│   ├── core/                # Personas, agents, LLM providers, scoring
│   │   └── llm_providers/  # OpenAI, Claude, Gemini integrations
│   ├── envs/               # Recommendation environment (Gymnasium)
│   └── wrappers/           # Metrics, feedback, logging
├── experiments/             # Experiment orchestration and baselines
├── experiment_runners/      # Configuration and launch scripts
│   └── configs/            # YAML configuration files
├── config/                 # Configuration dataclasses (Python code)
├── database/               # Product database, caching, HuggingFace sync
├── database_creation/      # Scripts for categorizing/processing products
├── data/                   # Personas, product mappings, trajectories
├── graphing/               # Visualization and analysis tools
├── webpage/                # Interactive leaderboard and submission interface

Key Features

Configuration System

All experiments use YAML configs with 31 parameters covering:

Experiment setup (type, episodes, trajectories, seeds)
Agent parameters (model, temperature, max questions)
Context modes (raw, summary, planning)
Feedback types (persona, oracle, LLM-based)
Checkpointing and resumption
Interactive trajectory generation

Example: experiment_runners/config_reference.yaml documents all parameters.

Experiment Types

AIR supports three experimental paradigms to isolate different adaptation challenges:

variable_category: Fixed persona, varying product categories (preference generalization)
variable_persona: Fixed category, varying user personas (user adaptation)
variable_settings: Both persona and category vary (full adaptation)

Planning Modes

Planning modes force the agent to give a recommendation after each question within an episode, enabling analysis of within-episode improvement and whether this learning rate increases across later episodes.

planning_no_strat: Non-modified experiment
planning_greedy: Greedy question selection
planning_dp: Dynamic programming-style lookahead

planning_mode: "planning_dp"
planning_interval: 5

Interactive Mode

Generate multiple trajectory variants for manual curation:

System produces N variants of Episode 1
User selects preferred variant
System generates N variants of Episode 2 from selected Episode 1
Repeat until trajectory complete

interactive_mode: true
interactive_variants: 10
interactive_input_file: "episode_01_variant_003.json"  # For continuation

Setup & Installation

Prerequisites

Python 3.9+
~1GB disk space (500MB database + dependencies)
API keys for LLM providers (at least OpenAI and Google for scoring):
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)

Installation Steps

1. Clone Repository

git clone https://github.com/namkoong-lab/personas.git
cd personas

2. Install Dependencies

pip install -r requirements.txt

3. Configure API Keys

Create a .env file in the project root:

OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...

4. Database Setup

The product database is hosted on HuggingFace and will automatically download on first run.

Automatic Setup (Recommended)

# Just run any experiment - database downloads automatically
cd experiment_runners
python run_experiment.py --config configs/basic_variable_category.yaml

On first run, you'll see:

🔄 Database not found. Downloading from HuggingFace...
📦 Downloading products_part1.parquet (4/4)...
✅ Database setup complete!

Manual Pre-download (Optional)

# Pre-download database before running experiments
cd database
python setup_database.py

This downloads 4 Parquet files (~500MB total) from HuggingFace and builds a local SQLite database.

Database Contents

The AIR database contains:

71,088 products from Amazon with rich metadata
2,030 product categories organized into substitute sets
Product attributes: titles, prices, ratings, descriptions, images
Score cache: Stores persona-product scores to avoid re-computation

Database Structure:

database/
├── personas.db              # SQLite database (auto-generated)
├── setup_database.py        # Download script
└── cache/                   # Downloaded Parquet files
    ├── products_part1.parquet
    ├── products_part2.parquet
    ├── products_part3.parquet
    └── products_part4.parquet

Score Caching: The database includes a persona_scores table that grows during experiments. Cached scores are reused across runs, speeding up repeated experiments with the same personas/categories.

Quick Start

Basic Experiments

cd experiment_runners

# Run with example config
python run_experiment.py --config configs/basic_variable_category.yaml

# Interactive trajectory building
python run_experiment.py --config configs/interactive_example.yaml

Resuming from Checkpoint

checkpoint_enabled: true
resume_from_checkpoint: "experiment_results/checkpoint_traj2_ep8.json"

Integrating Custom Models

AIR's modular architecture makes it easy to benchmark your own LLM models and agents.

Option 1: Add a New LLM Provider

Integrate a new LLM API (e.g., Cohere, Mistral, local models) in 4 steps:

Copy the template: Use pipeline/core/llm_providers/custom_provider_template.py as a starting point
Implement two methods:
- __init__(): Load API key and initialize client
- chat_completion(): Make API calls with retry logic
Register your provider in pipeline/core/llm_providers/__init__.py
Add API key to .env and use in your config

See: custom_provider_template.py for detailed implementation guide and examples from openai_provider.py, claude_provider.py, gemini_provider.py.

Test your provider:

from pipeline.core.llm_providers import chat_completion
response = chat_completion(
    model="my-model-v1",
    messages=[{"role": "user", "content": "Hello!"}]
)

Option 2: Custom Agent Logic

For advanced agent behavior (custom prompting, tool use, RAG), extend UnifiedAgent:

Create custom agent in pipeline/core/my_custom_agent.py
Override methods:
- decide_action(): Custom decision logic
- _build_llm_context(): Custom prompt construction
- Add pre/post-processing (tool calls, retrieval, etc.)
Modify experiment runner to use your agent class
Test on small scale before full experiments

Key extension points:

decide_action(): Control when to ask vs recommend
_build_llm_context(): Customize product/dialog presentation
_llm_decide_action(): Override core LLM prompting
Add external knowledge, tools, or multi-step reasoning

Citation

@article{yang2025bela,
  title={Benchmarking In-context Experiential Reasoning Through Repeated Product Recommendations},
  author={Yang, Gilbert and Chen, Yaqin and Yen, Thomson and Namkoong, Hongseok},
  year={2025}
}

License

MIT License - see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AIR: Agentic In-context Experiential Reasoning

Overview

Core Components

Project Structure

Key Features

Configuration System

Experiment Types

Planning Modes

Interactive Mode

Setup & Installation

Prerequisites

Installation Steps

Automatic Setup (Recommended)

Manual Pre-download (Optional)

Database Contents

Quick Start

Basic Experiments

Resuming from Checkpoint

Integrating Custom Models

Option 1: Add a New LLM Provider

Option 2: Custom Agent Logic

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
config		config
database		database
database_creation		database_creation
experiment_runners		experiment_runners
experiments		experiments
graphing		graphing
pipeline		pipeline
sample_experiment_results		sample_experiment_results
webpage		webpage
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manage_checkpoints.py		manage_checkpoints.py
requirements.txt		requirements.txt

License

namkoong-lab/interactive-benchmark

Folders and files

Latest commit

History

Repository files navigation

AIR: Agentic In-context Experiential Reasoning

Overview

Core Components

Project Structure

Key Features

Configuration System

Experiment Types

Planning Modes

Interactive Mode

Setup & Installation

Prerequisites

Installation Steps

Automatic Setup (Recommended)

Manual Pre-download (Optional)

Database Contents

Quick Start

Basic Experiments

Resuming from Checkpoint

Integrating Custom Models

Option 1: Add a New LLM Provider

Option 2: Custom Agent Logic

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages