Skip to content

🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a Chartered Accountancy benchmark and evaluation framework for systematically assessing LLM performance on domain‑specific examination questions.

Notifications You must be signed in to change notification settings

thejatingupta7/LLMCA

Repository files navigation

🤖 CA-Ben

This repository contains the modular Python application used to systematically evaluate Large Language Models (LLMs) against the CA-Ben (Chartered Accountancy Benchmark). It automates the entire workflow, from sending prompts to the models to processing their responses and extracting the final answers for analysis. 📊


Paper: Large Language Models Acing Chartered Accountancy.
Colab: Benchmark Loading and Analysis Script.
Benchmark: 📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩


image

✨ Features

  • 🧱 Modular Architecture: Clean separation of concerns. Each module handles a specific task (API calls, file I/O, retry logic), making the code easy to maintain and extend.
  • 🔄 Robust Retry Logic: Automatically handles API rate limits and transient network errors using an intelligent exponential backoff strategy. Never lose a request!
  • 🔑 Dynamic Token Management: Seamlessly refreshes API tokens upon encountering rate limit errors, ensuring uninterrupted long-running evaluation sessions.
  • 📂 Batch File Processing: Efficiently processes hundreds of question files organized in subdirectories, perfect for large-scale benchmarks.
  • 📝 Structured Excel Output: Saves all questions, model responses, and extracted answers into neatly organized .xlsx files for easy analysis and verification.
  • ⚙️ Centralized Configuration: Easily manage API keys, model parameters, and file paths from a single config.py file.

🚀 How to Run

Get the evaluation pipeline up and running in three simple steps:

  1. Install Dependencies 📦

    pip install -r requirements.txt
  2. Configure Your API Token 🔑 Open src/config.py and update the TOKEN variable with your API key.

    TOKEN = "your_super_secret_api_token_here"
  3. Launch the Application ▶️ Execute the main script from the root directory.

    python src/main.py

    The application will begin processing the files and you will see the progress in your console.


📁 Project Structure & Module Descriptions

The application is organized logically to ensure clarity and maintainability.

src/
├── 📜 main.py              # Main entry point - Kicks off the evaluation process.
├── ⚙️ config.py            # Central hub for all settings (API keys, model params, paths).
├── 🧠 prompts.py           # Stores the system prompts used to guide the AI models.
├── 🌐 api_client.py        # Handles all communication with the AI model APIs.
├── 📄 file_handler.py      # Manages all file operations (reading questions, writing results).
├── ⏳ retry_handler.py     # Implements the smart retry logic for failed API calls.
├── 🛠️ processor.py         # The core orchestrator that manages the entire workflow.
└── 📋 requirements.txt     # A list of all the Python libraries you need.

💡 Core Logic

The application's workflow is orchestrated by processor.py and follows these steps:

  1. Initialization: Loads configuration from config.py and system prompts from prompts.py.
  2. File Discovery: The file_handler.py scans the specified base folder for all question files.
  3. Processing Loop: For each question:
    • The question content is combined with a system prompt.
    • api_client.py sends the request to the target LLM.
    • If the API call fails, retry_handler.py takes over, waiting and retrying with exponential backoff.
  4. Response Handling: The model's raw text response is received.
  5. Data Storage: The question, the model's full response, and other metadata are saved to an Excel file by file_handler.py.
  6. Answer Extraction: A separate process (as described in our paper) parses the Excel files to extract the final answer choices for accuracy calculation.

📩 For access to the benchmark data, kindly email the corresponding author and cc the first author for reference. 📩


Table: Models' Performance on Foundation, Intermediate, and Final-level Subjects

Models Foundation Intermediate Final
F1F2 I1I2I3I4I5I6 FI1FI2FI3FI4FI5FI6
GPT 4o 50.0058.00 46.6673.3320.0020.0086.6675.00 71.4353.3378.5753.3333.3341.67
LLAMA 3.3 70B Int. 59.0056.00 33.3360.0040.0040.0073.3375.00 64.2933.3371.4353.336.6720.83
LLAMA 3.1 405B Int. 53.0059.00 40.0053.3320.0040.0086.6656.25 64.2946.6771.4313.3326.6741.67
MISTRAL Large 41.0056.00 41.6653.3331.2520.0073.3360.00 42.8641.6757.1446.6713.3329.17
Claude 3.5 Sonnet 60.0060.00 33.3360.0020.0046.6693.3375.00 78.5746.6764.2953.3320.0062.50
Microsoft Phi 4 56.0062.00 46.6646.6633.3333.3366.6668.75 64.2953.3357.1426.676.6741.67

Legend:

  • F1: Business Math & Stats
  • F2: Business Econ & BCK
  • I1: Adv. Accounting
  • I2: Corp. Laws
  • I3: Taxation
  • I4: Cost & Mgmt. Acct.
  • I5: Auditing & Ethics
  • I6: Fin. & Strat. Mgmt.
  • FI1: Fin. Reporting
  • FI2: Adv. Fin. Mgmt.
  • FI3: Adv. Auditing
  • FI4: Direct Tax Laws
  • FI5: Indirect Tax Laws
  • FI6: Integrated Business Sol.

📚 Citation

If you intend to use this work, please cite as:

@misc{gupta2025,
  title={Large Language Models Acing Chartered Accountancy}, 
  author={Jatin Gupta and Akhil Sharma and Saransh Singhania and Mohammad Adnan and Sakshi Deo and Ali Imam Abidi and Keshav Gupta},
  year={2025},
  eprint={2506.21031},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.21031}, 
}

About

🤖 Large Language Models Acing Chartered Accountancy: Introduces CA‑Ben 📈, a Chartered Accountancy benchmark and evaluation framework for systematically assessing LLM performance on domain‑specific examination questions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages