HS-MATH-LLM: Evaluating and Enhancing High School-Level Mathematical Reasoning in LLMs

This project evaluates the ability of large language models (LLMs) to solve high school-level math problems. It focuses on reasoning accuracy and confidence calibration using both single-step (baseline) and step-by-step (Chain-of-Thought, CoT) prompting.

For each model response, the system extracts structured outputs including predicted answers and multiple confidence estimates. These values quantify how certain the model is in its answers and are binned into reliability tables to evaluate calibration — i.e., whether higher confidence corresponds to higher accuracy.

📁 Project Structure

.
├── data/                          # Input datasets (original and reformatted)
│   ├── original_dataset.json
│   └── reformatted_augmented_dataset.json
│
├── outputs_baseline/             # Results from baseline evaluation
│   ├── prediction_with_uncertainties.json
│   ├── confidence_accuracy_self_eval.json
│   ├── confidence_accuracy_logit.json
│   └── confidence_accuracy_internal.json
│
├── outputs_cot/                  # Results from Chain-of-Thought evaluation
│   ├── prediction_with_uncertainties_cot.json
│   ├── confidence_accuracy_self_eval_cot.json
│   ├── confidence_accuracy_logit_cot.json
│   └── confidence_accuracy_internal_cot.json
│
├── scripts/
│   ├── baseline/
│   │   ├── evaluate_original.py       # Baseline evaluation script
│   │   └── utils.py                   # Baseline utils (prompting, parsing, evaluation)
│   │
│   ├── cot/
│   │   ├── evaluate_original_cot.py   # CoT evaluation script
│   │   └── utils.py                   # CoT-specific prompt and evaluation utilities
│   │
│   ├── const.py                       # Shared constants and configuration
│   ├── keys.py                        # Contains your PREMIUM_API_KEY

✅ Setup

To get started, follow these steps:

Install dependencies

Make sure you have Python 3.8+ and run:

pip install openai numpy

Add your API key

Create a file at scripts/keys.py containing:

PREMIUM_API_KEY = "sk-..."  # Replace with your actual API key

Check configuration

Open scripts/const.py and verify or update the following paths and model name:

MODEL_GPT_3_5 = "gpt-3.5-turbo"
INPUT_PATH_ORIGINAL = "data/original_dataset.json"
INPUT_PATH_AUGMENTED = "data/reformatted_augmented_dataset.json"
OUTPUT_DIR_BASELINE = "outputs_baseline"
OUTPUT_DIR_COT = "outputs_cot"

(Optional) Limit number of samples

To test on a subset of data, set NUM_SAMPLES = <integer> at the top of:

scripts/baseline/evaluate_original.py
scripts/cot/evaluate_original_cot.py

🚀 Usage

1. Evaluate Original Math Problems (Baseline)

To evaluate the original dataset using a single-step baseline approach:

python scripts/baseline/evaluate_original.py

This script will:

Load questions from data/original_dataset.json
Generate standard prompts using scripts/utils.py
Query the GPT model defined in scripts/const.py (default: gpt-3.5-turbo)
Parse structured outputs including:
- Predicted final answer
- Final confidence metrics:
  - self_eval_confidence
  - logit_based_confidence
  - internal_based_confidence
Save results to outputs_baseline/prediction_with_uncertainties.json
Create confidence-accuracy reliability tables:
- confidence_accuracy_self_eval.json
- confidence_accuracy_logit.json
- confidence_accuracy_internal.json

To test with fewer examples, set NUM_SAMPLES to an integer at the top of evaluate_original.py.

🧾 Example Output Snippet for Baseline

{
  "id": "xyz-789",
  "question": "If x + 2 = 5, what is x?",
  "expected_answer": ["C"],
  "model_response": {
    "predicted_answer": "C",
    "confidence": {
      "self_eval_confidence": 0.93,
      "logit_based_confidence": 0.91,
      "internal_based_confidence": 0.89
    }
  },
  "raw_text": "...",
  "timestamp": 1717019999.0
}

2. Evaluate Math Problems with Chain of Thought (CoT) Approach

To evaluate mathematical reasoning using a step-by-step Chain of Thought strategy:

python scripts/cot/evaluate_original_cot.py

This script will:

Load questions from data/original_dataset.json
Generate CoT-formatted prompts using scripts/cot/utils.py
Query the GPT model defined in scripts/const.py (default: gpt-3.5-turbo)
Parse structured outputs including:
- Predicted final answer
- Intermediate reasoning steps
- Confidence metrics at each step:
  - self_eval_confidence
  - logit_based_confidence
  - internal_based_confidence
Save results to outputs_cot/prediction_with_uncertainties_cot.json
Create confidence-accuracy reliability tables:
- confidence_accuracy_self_eval_cot.json
- confidence_accuracy_logit_cot.json
- confidence_accuracy_internal_cot.json

To test with fewer examples, set NUM_SAMPLES to an integer at the top of evaluate_original_cot.py.

🧾 Example Output Snippet for CoT

{
  "id": "abc-123",
  "question": "What is the value of 2x when x = 3?",
  "expected_answer": ["B"],
  "model_response": {
    "predicted_answer": "B",
    "intermediate_confidences": [
      {
        "step_number": 1,
        "self_eval_confidence": 0.85,
        "logit_based_confidence": 0.80,
        "internal_based_confidence": 0.75
      },
      {
        "step_number": 2,
        "self_eval_confidence": 0.90,
        "logit_based_confidence": 0.85,
        "internal_based_confidence": 0.80
      }
    ],
    "final_confidence": {
      "self_eval_confidence": 0.95,
      "logit_based_confidence": 0.90,
      "internal_based_confidence": 0.92
    }
  },
  "raw_text": "...",
  "timestamp": 1717012345.0
}

📊 Confidence Types

Each prediction includes:

Confidence Field	Description
`self_eval_confidence`	Model's subjective estimate of correctness
`logit_based_confidence`	Probability assigned to the predicted answer
`internal_based_confidence`	Confidence based on internal consistency (self-declared)

These are used to generate confidence-accuracy reliability tables.

📊 Reliability Tables (Binned by Confidence)

After each evaluation, the script generates reliability tables that show the accuracy of model predictions within different confidence intervals. These tables are saved as JSON files in the corresponding output directory (e.g., outputs_baseline/ or outputs_cot/).

Each entry in the table represents:

A confidence bin (e.g., 0.6–0.7)
The number of predictions that fell within that bin
The proportion of those predictions that were correct (accuracy)

🔢 Example

[
  {
    "confidence_bin": "0.0-0.1",
    "num_samples": 3,
    "accuracy": 0.0
  },
  {
    "confidence_bin": "0.8-0.9",
    "num_samples": 45,
    "accuracy": 0.844
  },
  {
    "confidence_bin": "0.9-1.0",
    "num_samples": 82,
    "accuracy": 0.902
  }
]

Use these tables to analyze how well model confidence correlates with actual correctness (calibration).

📌 Additional Notes

Prompts strictly enforce valid JSON-only output.
Invalid or unparsable responses are logged and skipped during evaluation.
The "confidence_distribution" field must sum to 1.
Intermediate results are saved incrementally to prevent data loss on interruption.
Set NUM_SAMPLES in each evaluation script to quickly test on smaller subsets.

🧠 Authors

Created as part of EEL6935: Safe Autonomous Systems coursework at University of Florida.

🔐 Disclaimer

You must have a valid OpenAI API key and/or access to the local server endpoint to run the scripts.

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
outputs_baseline		outputs_baseline
outputs_cot		outputs_cot
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HS-MATH-LLM: Evaluating and Enhancing High School-Level Mathematical Reasoning in LLMs

📁 Project Structure

✅ Setup

🚀 Usage

1. Evaluate Original Math Problems (Baseline)

🧾 Example Output Snippet for Baseline

2. Evaluate Math Problems with Chain of Thought (CoT) Approach

🧾 Example Output Snippet for CoT

📊 Confidence Types

📊 Reliability Tables (Binned by Confidence)

🔢 Example

📌 Additional Notes

🧠 Authors

🔐 Disclaimer

📄 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

abisliouk/HS-MATH-LLM

Folders and files

Latest commit

History

Repository files navigation

HS-MATH-LLM: Evaluating and Enhancing High School-Level Mathematical Reasoning in LLMs

📁 Project Structure

✅ Setup

🚀 Usage

1. Evaluate Original Math Problems (Baseline)

🧾 Example Output Snippet for Baseline

2. Evaluate Math Problems with Chain of Thought (CoT) Approach

🧾 Example Output Snippet for CoT

📊 Confidence Types

📊 Reliability Tables (Binned by Confidence)

🔢 Example

📌 Additional Notes

🧠 Authors

🔐 Disclaimer

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages