This repository contains the modular Python application used to systematically evaluate Large Language Models (LLMs) against the CA-Ben (Chartered Accountancy Benchmark). It automates the entire workflow, from sending prompts to the models to processing their responses and extracting the final answers for analysis. 📊
Paper: Large Language Models Acing Chartered Accountancy.
Colab: Benchmark Loading and Analysis Script.
Benchmark: 📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩

- 🧱 Modular Architecture: Clean separation of concerns. Each module handles a specific task (API calls, file I/O, retry logic), making the code easy to maintain and extend.
- 🔄 Robust Retry Logic: Automatically handles API rate limits and transient network errors using an intelligent exponential backoff strategy. Never lose a request!
- 🔑 Dynamic Token Management: Seamlessly refreshes API tokens upon encountering rate limit errors, ensuring uninterrupted long-running evaluation sessions.
- 📂 Batch File Processing: Efficiently processes hundreds of question files organized in subdirectories, perfect for large-scale benchmarks.
- 📝 Structured Excel Output: Saves all questions, model responses, and extracted answers into neatly organized
.xlsx
files for easy analysis and verification. - ⚙️ Centralized Configuration: Easily manage API keys, model parameters, and file paths from a single
config.py
file.
Get the evaluation pipeline up and running in three simple steps:
-
Install Dependencies 📦
pip install -r requirements.txt
-
Configure Your API Token 🔑 Open
src/config.py
and update theTOKEN
variable with your API key.TOKEN = "your_super_secret_api_token_here"
-
Launch the Application
▶️ Execute the main script from the root directory.python src/main.py
The application will begin processing the files and you will see the progress in your console.
The application is organized logically to ensure clarity and maintainability.
src/
├── 📜 main.py # Main entry point - Kicks off the evaluation process.
├── ⚙️ config.py # Central hub for all settings (API keys, model params, paths).
├── 🧠 prompts.py # Stores the system prompts used to guide the AI models.
├── 🌐 api_client.py # Handles all communication with the AI model APIs.
├── 📄 file_handler.py # Manages all file operations (reading questions, writing results).
├── ⏳ retry_handler.py # Implements the smart retry logic for failed API calls.
├── 🛠️ processor.py # The core orchestrator that manages the entire workflow.
└── 📋 requirements.txt # A list of all the Python libraries you need.
The application's workflow is orchestrated by processor.py
and follows these steps:
- Initialization: Loads configuration from
config.py
and system prompts fromprompts.py
. - File Discovery: The
file_handler.py
scans the specified base folder for all question files. - Processing Loop: For each question:
- The question content is combined with a system prompt.
api_client.py
sends the request to the target LLM.- If the API call fails,
retry_handler.py
takes over, waiting and retrying with exponential backoff.
- Response Handling: The model's raw text response is received.
- Data Storage: The question, the model's full response, and other metadata are saved to an Excel file by
file_handler.py
. - Answer Extraction: A separate process (as described in our paper) parses the Excel files to extract the final answer choices for accuracy calculation.
📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩
Models | Foundation | Intermediate | Final | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F1 | F2 | I1 | I2 | I3 | I4 | I5 | I6 | FI1 | FI2 | FI3 | FI4 | FI5 | FI6 | |
GPT 4o | 50.00 | 58.00 | 46.66 | 73.33 | 20.00 | 20.00 | 86.66 | 75.00 | 71.43 | 53.33 | 78.57 | 53.33 | 33.33 | 41.67 |
LLAMA 3.3 70B Int. | 59.00 | 56.00 | 33.33 | 60.00 | 40.00 | 40.00 | 73.33 | 75.00 | 64.29 | 33.33 | 71.43 | 53.33 | 6.67 | 20.83 |
LLAMA 3.1 405B Int. | 53.00 | 59.00 | 40.00 | 53.33 | 20.00 | 40.00 | 86.66 | 56.25 | 64.29 | 46.67 | 71.43 | 13.33 | 26.67 | 41.67 |
MISTRAL Large | 41.00 | 56.00 | 41.66 | 53.33 | 31.25 | 20.00 | 73.33 | 60.00 | 42.86 | 41.67 | 57.14 | 46.67 | 13.33 | 29.17 |
Claude 3.5 Sonnet | 60.00 | 60.00 | 33.33 | 60.00 | 20.00 | 46.66 | 93.33 | 75.00 | 78.57 | 46.67 | 64.29 | 53.33 | 20.00 | 62.50 |
Microsoft Phi 4 | 56.00 | 62.00 | 46.66 | 46.66 | 33.33 | 33.33 | 66.66 | 68.75 | 64.29 | 53.33 | 57.14 | 26.67 | 6.67 | 41.67 |
Legend:
|
|
If you intend to use this work, please cite as:
@misc{gupta2025,
title={Large Language Models Acing Chartered Accountancy},
author={Jatin Gupta and Akhil Sharma and Saransh Singhania and Mohammad Adnan and Sakshi Deo and Ali Imam Abidi and Keshav Gupta},
year={2025},
eprint={2506.21031},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.21031},
}