🤖 CA-Ben

This repository contains the modular Python application used to systematically evaluate Large Language Models (LLMs) against the CA-Ben (Chartered Accountancy Benchmark). It automates the entire workflow, from sending prompts to the models to processing their responses and extracting the final answers for analysis. 📊

Paper: Large Language Models Acing Chartered Accountancy.
Colab: Benchmark Loading and Analysis Script.
Benchmark: 📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩

✨ Features

🧱 Modular Architecture: Clean separation of concerns. Each module handles a specific task (API calls, file I/O, retry logic), making the code easy to maintain and extend.
🔄 Robust Retry Logic: Automatically handles API rate limits and transient network errors using an intelligent exponential backoff strategy. Never lose a request!
🔑 Dynamic Token Management: Seamlessly refreshes API tokens upon encountering rate limit errors, ensuring uninterrupted long-running evaluation sessions.
📂 Batch File Processing: Efficiently processes hundreds of question files organized in subdirectories, perfect for large-scale benchmarks.
📝 Structured Excel Output: Saves all questions, model responses, and extracted answers into neatly organized .xlsx files for easy analysis and verification.
⚙️ Centralized Configuration: Easily manage API keys, model parameters, and file paths from a single config.py file.

🚀 How to Run

Get the evaluation pipeline up and running in three simple steps:

Install Dependencies 📦
```
pip install -r requirements.txt
```
Configure Your API Token 🔑 Open src/config.py and update the TOKEN variable with your API key.
```
TOKEN = "your_super_secret_api_token_here"
```
Launch the Application ▶️ Execute the main script from the root directory.
```
python src/main.py
```
The application will begin processing the files and you will see the progress in your console.

📁 Project Structure & Module Descriptions

The application is organized logically to ensure clarity and maintainability.

src/
├── 📜 main.py              # Main entry point - Kicks off the evaluation process.
├── ⚙️ config.py            # Central hub for all settings (API keys, model params, paths).
├── 🧠 prompts.py           # Stores the system prompts used to guide the AI models.
├── 🌐 api_client.py        # Handles all communication with the AI model APIs.
├── 📄 file_handler.py      # Manages all file operations (reading questions, writing results).
├── ⏳ retry_handler.py     # Implements the smart retry logic for failed API calls.
├── 🛠️ processor.py         # The core orchestrator that manages the entire workflow.
└── 📋 requirements.txt     # A list of all the Python libraries you need.

💡 Core Logic

The application's workflow is orchestrated by processor.py and follows these steps:

Initialization: Loads configuration from config.py and system prompts from prompts.py.
File Discovery: The file_handler.py scans the specified base folder for all question files.
Processing Loop: For each question:
- The question content is combined with a system prompt.
- api_client.py sends the request to the target LLM.
- If the API call fails, retry_handler.py takes over, waiting and retrying with exponential backoff.
Response Handling: The model's raw text response is received.
Data Storage: The question, the model's full response, and other metadata are saved to an Excel file by file_handler.py.
Answer Extraction: A separate process (as described in our paper) parses the Excel files to extract the final answer choices for accuracy calculation.

📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩

Table: Models' Performance on Foundation, Intermediate, and Final-level Subjects

Models	Foundation		Intermediate						Final
Models	F1	F2	I1	I2	I3	I4	I5	I6	FI1	FI2	FI3	FI4	FI5	FI6
GPT 4o	50.00	58.00	46.66	73.33	20.00	20.00	86.66	75.00	71.43	53.33	78.57	53.33	33.33	41.67
LLAMA 3.3 70B Int.	59.00	56.00	33.33	60.00	40.00	40.00	73.33	75.00	64.29	33.33	71.43	53.33	6.67	20.83
LLAMA 3.1 405B Int.	53.00	59.00	40.00	53.33	20.00	40.00	86.66	56.25	64.29	46.67	71.43	13.33	26.67	41.67
MISTRAL Large	41.00	56.00	41.66	53.33	31.25	20.00	73.33	60.00	42.86	41.67	57.14	46.67	13.33	29.17
Claude 3.5 Sonnet	60.00	60.00	33.33	60.00	20.00	46.66	93.33	75.00	78.57	46.67	64.29	53.33	20.00	62.50
Microsoft Phi 4	56.00	62.00	46.66	46.66	33.33	33.33	66.66	68.75	64.29	53.33	57.14	26.67	6.67	41.67

Legend:

F1: Business Math & Stats
F2: Business Econ & BCK
I1: Adv. Accounting
I2: Corp. Laws
I3: Taxation
I4: Cost & Mgmt. Acct.
I5: Auditing & Ethics

I6: Fin. & Strat. Mgmt.
FI1: Fin. Reporting
FI2: Adv. Fin. Mgmt.
FI3: Adv. Auditing
FI4: Direct Tax Laws
FI5: Indirect Tax Laws
FI6: Integrated Business Sol.

📚 Citation

If you intend to use this work, please cite as:

@misc{gupta2025,
  title={Large Language Models Acing Chartered Accountancy}, 
  author={Jatin Gupta and Akhil Sharma and Saransh Singhania and Mohammad Adnan and Sakshi Deo and Ali Imam Abidi and Keshav Gupta},
  year={2025},
  eprint={2506.21031},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.21031}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 CA-Ben

✨ Features

🚀 How to Run

📁 Project Structure & Module Descriptions

💡 Core Logic

📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩

Table: Models' Performance on Foundation, Intermediate, and Final-level Subjects

📚 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
api_client.py		api_client.py
config.py		config.py
file_handler.py		file_handler.py
main.py		main.py
processor.py		processor.py
prompts.py		prompts.py
requirements.txt		requirements.txt
retry_handler.py		retry_handler.py

thejatingupta7/LLMCA

Folders and files

Latest commit

History

Repository files navigation

🤖 CA-Ben

✨ Features

🚀 How to Run

📁 Project Structure & Module Descriptions

💡 Core Logic

📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩

Table: Models' Performance on Foundation, Intermediate, and Final-level Subjects

📚 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages