Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Sonnet 3.7 eval #7

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 196 additions & 0 deletions putnam-evaluation-sonnet-3-7/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Evaluating Advanced Mathematical Reasoning with Claude 3.7 Sonnet and Extended Thinking

This tutorial demonstrates how to evaluate Claude 3.7 Sonnet's mathematical reasoning capabilities on the challenging Putnam 2023 competition problems using Anthropic's extended thinking feature and HoneyHive for evaluation tracking.

## Overview

The William Lowell Putnam Mathematical Competition is the preeminent mathematics competition for undergraduate college students in North America, known for its exceptionally challenging problems that test deep mathematical thinking and rigorous proof writing abilities.

This evaluation leverages Claude 3.7 Sonnet's extended thinking capabilities, which allow the model to show its step-by-step reasoning process before delivering a final answer. This is particularly valuable for complex mathematical problems where the reasoning path is as important as the final solution.

## Key Features

- **Extended Thinking**: Uses Claude 3.7 Sonnet's thinking tokens to capture the model's internal reasoning process
- **Dual Evaluation**: Assesses both the final solution quality and the thinking process quality
- **Comprehensive Metrics**: Tracks performance across different types of mathematical problems
- **HoneyHive Integration**: Stores and visualizes evaluation results for analysis

## Table of Contents

1. [Prerequisites](#prerequisites)
2. [Setup](#setup)
3. [Configuration](#configuration)
4. [Running the Evaluation](#running-the-evaluation)
5. [Understanding the Results](#understanding-the-results)
6. [Advanced Usage](#advanced-usage)
7. [Troubleshooting](#troubleshooting)

## Prerequisites

Before you begin, make sure you have:

- **Python 3.10+** installed
- An **Anthropic API key** with access to Claude 3.7 Sonnet
- A **HoneyHive API key**, along with your **HoneyHive project name** and **dataset ID**
- The Putnam 2023 questions and solutions in the provided JSONL file

## Setup

1. **Clone the repository** (if you haven't already):
```bash
git clone https://github.com/honeyhiveai/cookbook
cd putnam-evaluation-sonnet-3-7
```

2. **Create and activate a virtual environment**:
```bash
# Create a virtual environment
python -m venv putnam_eval_env

# On macOS/Linux:
source putnam_eval_env/bin/activate

# On Windows:
putnam_eval_env\Scripts\activate
```

3. **Install required packages**:
```bash
pip install -r requirements.txt
```

## Configuration

Open the `putnam_eval.py` script and update the following:

### Update API Keys

Replace the placeholder API keys with your actual keys:

```python
# Replace with your actual Anthropic API key
ANTHROPIC_API_KEY = 'YOUR_ANTHROPIC_API_KEY'
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY

# In the main execution block, update HoneyHive credentials
evaluate(
function=putnam_qa,
hh_api_key='YOUR_HONEYHIVE_API_KEY',
hh_project='YOUR_HONEYHIVE_PROJECT_NAME',
name='Putnam Q&A Eval with Claude 3.7 Sonnet Thinking',
dataset_id='YOUR_HONEYHIVE_DATASET_ID',
evaluators=[response_quality_evaluator, thinking_process_evaluator]
)
```

### Adjust Thinking Budget (Optional)

You can modify the thinking token budget based on your needs:

```python
completion = anthropic_client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=20000,
thinking={
"type": "enabled",
"budget_tokens": 16000 # Adjust this value as needed
},
messages=[
{"role": "user", "content": question}
]
)
```

## Running the Evaluation

1. **Prepare your dataset**:
- The included `putnam_2023.jsonl` file contains the Putnam 2023 competition problems
- Upload this dataset to HoneyHive following their [dataset import guide](https://docs.honeyhive.ai/datasets/import)

2. **Execute the evaluation script**:
```bash
python putnam_eval.py
```

3. **Monitor progress**:
- The script will process each problem in the dataset
- Progress will be displayed in the terminal
- Results will be pushed to HoneyHive for visualization

## Understanding the Results

The evaluation produces two key metrics for each problem:

1. **Solution Quality Score (0-10)**:
- Assesses the correctness, completeness, and elegance of the final solution
- Based on the strict grading criteria of the Putnam Competition

2. **Thinking Process Score (0-10)**:
- Evaluates the quality of the model's reasoning approach
- Considers problem decomposition, technique selection, and logical progression

In HoneyHive, you can:
- Compare performance across different problem types
- Analyze where the model excels or struggles
- Identify patterns in reasoning approaches

## Advanced Usage

### Adjusting Evaluation Criteria

You can modify the evaluation prompts in both evaluator functions to focus on specific aspects of mathematical reasoning:

```python
# In response_quality_evaluator
grading_prompt = f"""
[Instruction]
Please act as an impartial judge and evaluate...
"""

# In thinking_process_evaluator
thinking_evaluation_prompt = f"""
[Instruction]
Please evaluate the quality of the AI assistant's thinking process...
"""
```

### Streaming Responses

For real-time monitoring of the model's thinking process, you can implement streaming:

```python
with anthropic_client.messages.stream(
model="claude-3-7-sonnet-20250219",
max_tokens=20000,
thinking={
"type": "enabled",
"budget_tokens": 16000
},
messages=[{"role": "user", "content": question}]
) as stream:
for event in stream:
# Process streaming events
pass
```

## Troubleshooting

### Common Issues

1. **API Key Errors**:
- Ensure your Anthropic API key is valid and has access to Claude 3.7 Sonnet
- Check that environment variables are properly set

2. **Timeout Errors**:
- Complex problems may require longer processing time
- Consider implementing retry logic for long-running requests

3. **Memory Issues**:
- Processing thinking content for all problems may require significant memory
- Consider batching evaluations for large datasets

### Getting Help

If you encounter issues:
- Check the [Anthropic API documentation](https://docs.anthropic.com/claude/reference/getting-started-with-the-api)
- Visit the [HoneyHive documentation](https://docs.honeyhive.ai/)
172 changes: 172 additions & 0 deletions putnam-evaluation-sonnet-3-7/batch_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
import os
import json
import time
import argparse
from concurrent.futures import ThreadPoolExecutor
from anthropic import Anthropic

# Replace with your actual Anthropic API key
ANTHROPIC_API_KEY = 'your anthropic api key'
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY

# Initialize the Anthropic client
anthropic_client = Anthropic(api_key=ANTHROPIC_API_KEY)

def load_problems(file_path, problem_ids=None):
"""
Load problems from the JSONL file.
If problem_ids is provided, only load those specific problems.
"""
problems = []
with open(file_path, 'r') as f:
for line in f:
problem = json.loads(line)
if problem_ids is None or problem.get('question_id') in problem_ids:
problems.append(problem)
return problems

def solve_problem(problem, thinking_budget=16000):
"""Solve a single Putnam problem using Claude 3.7 Sonnet with thinking enabled."""
print(f"Processing problem {problem['question_id']}: {problem['question_category']}")

try:
# Create the completion with thinking enabled
completion = anthropic_client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=20000,
thinking={
"type": "enabled",
"budget_tokens": thinking_budget
},
messages=[
{"role": "user", "content": problem['question']}
]
)

# Extract the thinking content and final response
thinking_content = ""
final_response = ""

for content_block in completion.content:
if content_block.type == "thinking":
thinking_content += content_block.thinking
elif content_block.type == "text":
final_response += content_block.text

# Return the results
return {
"problem_id": problem['question_id'],
"category": problem['question_category'],
"question": problem['question'],
"thinking": thinking_content,
"solution": final_response,
"ground_truth": problem['solution'],
"status": "success"
}

except Exception as e:
print(f"Error processing problem {problem['question_id']}: {str(e)}")
return {
"problem_id": problem['question_id'],
"category": problem['question_category'],
"question": problem['question'],
"thinking": "",
"solution": "",
"ground_truth": problem['solution'],
"status": "error",
"error": str(e)
}

def batch_evaluate(problems, output_dir="results", max_workers=3, thinking_budget=16000):
"""
Evaluate multiple problems in parallel using a thread pool.

Args:
problems: List of problem dictionaries to evaluate
output_dir: Directory to save results
max_workers: Maximum number of concurrent workers
thinking_budget: Number of tokens to allocate for thinking
"""
# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

results = []
start_time = time.time()

# Process problems in parallel
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# Submit all problems to the executor
future_to_problem = {
executor.submit(solve_problem, problem, thinking_budget): problem
for problem in problems
}

# Process results as they complete
for i, future in enumerate(future_to_problem):
problem = future_to_problem[future]
try:
result = future.result()
results.append(result)

# Save individual result
with open(f"{output_dir}/result_{result['problem_id']}.json", 'w') as f:
json.dump(result, f, indent=2)

print(f"Completed {i+1}/{len(problems)}: Problem {result['problem_id']}")

except Exception as e:
print(f"Error processing problem {problem['question_id']}: {str(e)}")
results.append({
"problem_id": problem['question_id'],
"status": "error",
"error": str(e)
})

# Calculate total time
total_time = time.time() - start_time

# Save all results to a single file
with open(f"{output_dir}/all_results.json", 'w') as f:
json.dump({
"results": results,
"total_time": total_time,
"problems_count": len(problems),
"success_count": sum(1 for r in results if r.get("status") == "success"),
"error_count": sum(1 for r in results if r.get("status") == "error"),
}, f, indent=2)

print(f"\nEvaluation completed in {total_time:.2f} seconds")
print(f"Results saved to {output_dir}/all_results.json")

return results

def main():
# Set up argument parser
parser = argparse.ArgumentParser(description="Batch evaluate Putnam problems using Claude 3.7 Sonnet with thinking")
parser.add_argument("--input", default="putnam_2023.jsonl", help="Input JSONL file with problems")
parser.add_argument("--output", default="results", help="Output directory for results")
parser.add_argument("--problems", nargs="+", help="Specific problem IDs to evaluate (e.g., A1 B2)")
parser.add_argument("--workers", type=int, default=3, help="Maximum number of concurrent workers")
parser.add_argument("--thinking-budget", type=int, default=16000, help="Token budget for thinking")

args = parser.parse_args()

# Load problems
problems = load_problems(args.input, args.problems)

if not problems:
print("No problems found!")
return

print(f"Loaded {len(problems)} problems for evaluation")

# Run batch evaluation
batch_evaluate(
problems,
output_dir=args.output,
max_workers=args.workers,
thinking_budget=args.thinking_budget
)

if __name__ == "__main__":
main()
Loading