honeyhiveai · rachittshah · Feb 25, 2025
diff --git a/putnam-evaluation-sonnet-3-7/README.md b/putnam-evaluation-sonnet-3-7/README.md
@@ -0,0 +1,196 @@
+# Evaluating Advanced Mathematical Reasoning with Claude 3.7 Sonnet and Extended Thinking
+
+This tutorial demonstrates how to evaluate Claude 3.7 Sonnet's mathematical reasoning capabilities on the challenging Putnam 2023 competition problems using Anthropic's extended thinking feature and HoneyHive for evaluation tracking.
+
+## Overview
+
+The William Lowell Putnam Mathematical Competition is the preeminent mathematics competition for undergraduate college students in North America, known for its exceptionally challenging problems that test deep mathematical thinking and rigorous proof writing abilities.
+
+This evaluation leverages Claude 3.7 Sonnet's extended thinking capabilities, which allow the model to show its step-by-step reasoning process before delivering a final answer. This is particularly valuable for complex mathematical problems where the reasoning path is as important as the final solution.
+
+## Key Features
+
+- **Extended Thinking**: Uses Claude 3.7 Sonnet's thinking tokens to capture the model's internal reasoning process
+- **Dual Evaluation**: Assesses both the final solution quality and the thinking process quality
+- **Comprehensive Metrics**: Tracks performance across different types of mathematical problems
+- **HoneyHive Integration**: Stores and visualizes evaluation results for analysis
+
+## Table of Contents
+
+1. [Prerequisites](#prerequisites)
+2. [Setup](#setup)
+3. [Configuration](#configuration)
+4. [Running the Evaluation](#running-the-evaluation)
+5. [Understanding the Results](#understanding-the-results)
+6. [Advanced Usage](#advanced-usage)
+7. [Troubleshooting](#troubleshooting)
+
+## Prerequisites
+
+Before you begin, make sure you have:
+
+- **Python 3.10+** installed
+- An **Anthropic API key** with access to Claude 3.7 Sonnet
+- A **HoneyHive API key**, along with your **HoneyHive project name** and **dataset ID**
+- The Putnam 2023 questions and solutions in the provided JSONL file
+
+## Setup
+
+1. **Clone the repository** (if you haven't already):
+   ```bash
+   git clone https://github.com/honeyhiveai/cookbook
+   cd putnam-evaluation-sonnet-3-7
+   ```
+
+2. **Create and activate a virtual environment**:
+   ```bash
+   # Create a virtual environment
+   python -m venv putnam_eval_env
+
+   # On macOS/Linux:
+   source putnam_eval_env/bin/activate
+
+   # On Windows:
+   putnam_eval_env\Scripts\activate
+   ```
+
+3. **Install required packages**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+## Configuration
+
+Open the `putnam_eval.py` script and update the following:
+
+### Update API Keys
+
+Replace the placeholder API keys with your actual keys:
+
+```python
+# Replace with your actual Anthropic API key
+ANTHROPIC_API_KEY = 'YOUR_ANTHROPIC_API_KEY'
+os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY
+
+# In the main execution block, update HoneyHive credentials
+evaluate(
+    function=putnam_qa,
+    hh_api_key='YOUR_HONEYHIVE_API_KEY',
+    hh_project='YOUR_HONEYHIVE_PROJECT_NAME',
+    name='Putnam Q&A Eval with Claude 3.7 Sonnet Thinking',
+    dataset_id='YOUR_HONEYHIVE_DATASET_ID',
+    evaluators=[response_quality_evaluator, thinking_process_evaluator]
+)
+```
+
+### Adjust Thinking Budget (Optional)
+
+You can modify the thinking token budget based on your needs:
+
+```python
+completion = anthropic_client.messages.create(
+    model="claude-3-7-sonnet-20250219",
+    max_tokens=20000,
+    thinking={
+        "type": "enabled",
+        "budget_tokens": 16000  # Adjust this value as needed
+    },
+    messages=[
+        {"role": "user", "content": question}
+    ]
+)
+```
+
+## Running the Evaluation
+
+1. **Prepare your dataset**:
+   - The included `putnam_2023.jsonl` file contains the Putnam 2023 competition problems
+   - Upload this dataset to HoneyHive following their [dataset import guide](https://docs.honeyhive.ai/datasets/import)
+
+2. **Execute the evaluation script**:
+   ```bash
+   python putnam_eval.py
+   ```
+
+3. **Monitor progress**:
+   - The script will process each problem in the dataset
+   - Progress will be displayed in the terminal
+   - Results will be pushed to HoneyHive for visualization
+
+## Understanding the Results
+
+The evaluation produces two key metrics for each problem:
+
+1. **Solution Quality Score (0-10)**:
+   - Assesses the correctness, completeness, and elegance of the final solution
+   - Based on the strict grading criteria of the Putnam Competition
+
+2. **Thinking Process Score (0-10)**:
+   - Evaluates the quality of the model's reasoning approach
+   - Considers problem decomposition, technique selection, and logical progression
+
+In HoneyHive, you can:
+- Compare performance across different problem types
+- Analyze where the model excels or struggles
+- Identify patterns in reasoning approaches
+
+## Advanced Usage
+
+### Adjusting Evaluation Criteria
+
+You can modify the evaluation prompts in both evaluator functions to focus on specific aspects of mathematical reasoning:
+
+```python
+# In response_quality_evaluator
+grading_prompt = f"""
+[Instruction]
+Please act as an impartial judge and evaluate...
+"""
+
+# In thinking_process_evaluator
+thinking_evaluation_prompt = f"""
+[Instruction]
+Please evaluate the quality of the AI assistant's thinking process...
+"""
+```
+
+### Streaming Responses
+
+For real-time monitoring of the model's thinking process, you can implement streaming:
+
+```python
+with anthropic_client.messages.stream(
+    model="claude-3-7-sonnet-20250219",
+    max_tokens=20000,
+    thinking={
+        "type": "enabled",
+        "budget_tokens": 16000
+    },
+    messages=[{"role": "user", "content": question}]
+) as stream:
+    for event in stream:
+        # Process streaming events
+        pass
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **API Key Errors**:
+   - Ensure your Anthropic API key is valid and has access to Claude 3.7 Sonnet
+   - Check that environment variables are properly set
+
+2. **Timeout Errors**:
+   - Complex problems may require longer processing time
+   - Consider implementing retry logic for long-running requests
+
+3. **Memory Issues**:
+   - Processing thinking content for all problems may require significant memory
+   - Consider batching evaluations for large datasets
+
+### Getting Help
+
+If you encounter issues:
+- Check the [Anthropic API documentation](https://docs.anthropic.com/claude/reference/getting-started-with-the-api)
+- Visit the [HoneyHive documentation](https://docs.honeyhive.ai/)
diff --git a/putnam-evaluation-sonnet-3-7/batch_eval.py b/putnam-evaluation-sonnet-3-7/batch_eval.py
@@ -0,0 +1,172 @@
+import os
+import json
+import time
+import argparse
+from concurrent.futures import ThreadPoolExecutor
+from anthropic import Anthropic
+
+# Replace with your actual Anthropic API key
+ANTHROPIC_API_KEY = 'your anthropic api key'
+os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_API_KEY
+
+# Initialize the Anthropic client
+anthropic_client = Anthropic(api_key=ANTHROPIC_API_KEY)
+
+def load_problems(file_path, problem_ids=None):
+    """
+    Load problems from the JSONL file.
+    If problem_ids is provided, only load those specific problems.
+    """
+    problems = []
+    with open(file_path, 'r') as f:
+        for line in f:
+            problem = json.loads(line)
+            if problem_ids is None or problem.get('question_id') in problem_ids:
+                problems.append(problem)
+    return problems
+
+def solve_problem(problem, thinking_budget=16000):
+    """Solve a single Putnam problem using Claude 3.7 Sonnet with thinking enabled."""
+    print(f"Processing problem {problem['question_id']}: {problem['question_category']}")
+
+    try:
+        # Create the completion with thinking enabled
+        completion = anthropic_client.messages.create(
+            model="claude-3-7-sonnet-20250219",
+            max_tokens=20000,
+            thinking={
+                "type": "enabled",
+                "budget_tokens": thinking_budget
+            },
+            messages=[
+                {"role": "user", "content": problem['question']}
+            ]
+        )
+
+        # Extract the thinking content and final response
+        thinking_content = ""
+        final_response = ""
+
+        for content_block in completion.content:
+            if content_block.type == "thinking":
+                thinking_content += content_block.thinking
+            elif content_block.type == "text":
+                final_response += content_block.text
+
+        # Return the results
+        return {
+            "problem_id": problem['question_id'],
+            "category": problem['question_category'],
+            "question": problem['question'],
+            "thinking": thinking_content,
+            "solution": final_response,
+            "ground_truth": problem['solution'],
+            "status": "success"
+        }
+
+    except Exception as e:
+        print(f"Error processing problem {problem['question_id']}: {str(e)}")
+        return {
+            "problem_id": problem['question_id'],
+            "category": problem['question_category'],
+            "question": problem['question'],
+            "thinking": "",
+            "solution": "",
+            "ground_truth": problem['solution'],
+            "status": "error",
+            "error": str(e)
+        }
+
+def batch_evaluate(problems, output_dir="results", max_workers=3, thinking_budget=16000):
+    """
+    Evaluate multiple problems in parallel using a thread pool.
+
+    Args:
+        problems: List of problem dictionaries to evaluate
+        output_dir: Directory to save results
+        max_workers: Maximum number of concurrent workers
+        thinking_budget: Number of tokens to allocate for thinking
+    """
+    # Create output directory if it doesn't exist
+    os.makedirs(output_dir, exist_ok=True)
+
+    results = []
+    start_time = time.time()
+
+    # Process problems in parallel
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        # Submit all problems to the executor
+        future_to_problem = {
+            executor.submit(solve_problem, problem, thinking_budget): problem 
+            for problem in problems
+        }
+
+        # Process results as they complete
+        for i, future in enumerate(future_to_problem):
+            problem = future_to_problem[future]
+            try:
+                result = future.result()
+                results.append(result)
+
+                # Save individual result
+                with open(f"{output_dir}/result_{result['problem_id']}.json", 'w') as f:
+                    json.dump(result, f, indent=2)
+
+                print(f"Completed {i+1}/{len(problems)}: Problem {result['problem_id']}")
+
+            except Exception as e:
+                print(f"Error processing problem {problem['question_id']}: {str(e)}")
+                results.append({
+                    "problem_id": problem['question_id'],
+                    "status": "error",
+                    "error": str(e)
+                })
+
+    # Calculate total time
+    total_time = time.time() - start_time
+
+    # Save all results to a single file
+    with open(f"{output_dir}/all_results.json", 'w') as f:
+        json.dump({
+            "results": results,
+            "total_time": total_time,
+            "problems_count": len(problems),
+            "success_count": sum(1 for r in results if r.get("status") == "success"),
+            "error_count": sum(1 for r in results if r.get("status") == "error"),
+        }, f, indent=2)
+
+    print(f"\nEvaluation completed in {total_time:.2f} seconds")
+    print(f"Results saved to {output_dir}/all_results.json")
+
+    return results
+
+def main():
+    # Set up argument parser
+    parser = argparse.ArgumentParser(description="Batch evaluate Putnam problems using Claude 3.7 Sonnet with thinking")
+    parser.add_argument("--input", default="putnam_2023.jsonl", help="Input JSONL file with problems")
+    parser.add_argument("--output", default="results", help="Output directory for results")
+    parser.add_argument("--problems", nargs="+", help="Specific problem IDs to evaluate (e.g., A1 B2)")
+    parser.add_argument("--workers", type=int, default=3, help="Maximum number of concurrent workers")
+    parser.add_argument("--thinking-budget", type=int, default=16000, help="Token budget for thinking")
+
+    args = parser.parse_args()
+
+    # Load problems
+    problems = load_problems(args.input, args.problems)
+
+    if not problems:
+        print("No problems found!")
+        return
+
+    print(f"Loaded {len(problems)} problems for evaluation")
+
+    # Run batch evaluation
+    batch_evaluate(
+        problems, 
+        output_dir=args.output,
+        max_workers=args.workers,
+        thinking_budget=args.thinking_budget
+    )
+
+if __name__ == "__main__":
+    main()