NitiBench

[Technical Report] | [🤗 Hugging Face Dataset]

This repository hosts the evaluation script for the proposed benchmark in the paper:
NitiBench: A Comprehensive Study of LLM Frameworks’ Capabilities for Thai Legal Question Answering

It contains two main scripts:

Generating responses using the setup proposed in the paper.
Evaluating responses in both retrieval and end-to-end aspects.

📌 Getting Started

1️⃣ Clone the Repository

Clone this repository to your local machine:

git clone [REPO_URL]
cd NitiBench

2️⃣ Configure API Keys

Edit the environment settings file (setting.env) to store all your API keys.
An example configuration is provided in setting.env.example.

3️⃣ Build and Run the Docker Container

Use the following command to build the Docker image and create a container:

docker build -t nitibench . & 
docker run -dit --rm --network=host --gpus all --shm-size=10gb --name nitibench-container nitibench bash

When the image is created, the script setup_data.py will be executed to pull the data from HuggingFace, preprocess and store in /app/test_data

4️⃣ Expected File Structure

Once inside the container, the file structure should look like this:

app/
|---LRG/
|   |---[packages]
|---test_data/
|   |---hf_tax.csv
|   |---hf_wcx.csv
|   |---lclm_sample.csv
|   |---hf_tax_reduced_section.csv
|   |---hf_wcx_reduced_section.csv
|---llama_index/

hf_tax.csv & hf_wcx.csv → Tax Case and WCX-CCL datasets.
hf_tax_reduced_section.csv & hf_wcx_reduced_section.csv → Reduced versions containing only queries that use sections within naive chunking strategy.
lclm_sample.csv → A 20% stratified sample of the WCX-CCL dataset.

🚀 Using the Benchmark

1️⃣ Generating Responses

To generate responses, use the configuration files inside:
📂 /app/LRG/config/all_e2e_config/

Run the following command:

python script/response_e2e.py --config_path=[PATH_TO_YOUR_CONFIG]

You can adjust the config file to match your preferences.
The generated responses will be saved as:
- tax_response.json
- wcx_response.json

2️⃣ Evaluating Responses

To evaluate the responses, create a config file inside:
📂 /app/LRG/config/all_e2e_metric_config/

Run the evaluation script:

python script/metric_e2e.py --config_path=[PATH_TO_YOUR_CONFIG]

The evaluation results will be saved in:

Per-query metrics:
- tax_e2e_metrics.json
- wcx_e2e_metrics.json
Global metrics:
- tax_global_metrics.json
- wcx_global_metrics.json

Models

Acknowledgement

We would like to express our sincere gratitude to Supavich Punchun for facilitating WCX-CCL data preparation, and to Apiwat Sukthawornpradit, Watcharit Boonying, and Tawan Tantakull for scraping, preprocessing, and preparing the Tax Case Dataset. We also thank all VISAI.AI company members for assisting in quality control for LLM-as-a-judge metric validation.

We are deeply thankful to the legal expert annotators for their meticulous work in annotating samples, which was essential for validating the LLM-as-a-judge metrics.

Special thanks to Prof. Keerakiat Pratai (Faculty of Law, Thammasat University) for insightful consultations on Thai legal information and background knowledge, which significantly enriched our research.

We sincerely thank PTT, SCB, and SCBX, the main sponsors of the WangchanX project, for their generous support. Their contributions have been instrumental in advancing research on Thai legal AI.

Next, we extend our appreciation to the research assistants at VISTEC for their valuable guidance in constructing benchmarks for LLM systems, particularly in retrieval and end-to-end (E2E) metrics.

Lastly, if you use our code in your research, please cite our work:

@misc{akarajaradwong2025nitibenchcomprehensivestudiesllm,
      title={NitiBench: A Comprehensive Studies of LLM Frameworks Capabilities for Thai Legal Question Answering}, 
      author={Pawitsapak Akarajaradwong and Pirat Pothavorn and Chompakorn Chaksangchaichot and Panuthep Tasawong and Thitiwat Nopparatbundit and Sarana Nutanong},
      year={2025},
      eprint={2502.10868},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10868}, 
}

Contribution

We welcome contributions from the community! Whether it's bug fixes, feature additions, or documentation improvements, your input is valuable.

How to Contribute

Fork the repository
Create your feature branch
```
git checkout -b feature/NewFeature
```
Commit your changes
```
git commit -m 'Add some NewFeature'
```
Push to the branch
```
git push origin feature/NewFeature
```
Open a Pull Request

We look forward to your contributions! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
chunking		chunking
config		config
dump		dump
llama_index_extra		llama_index_extra
lrg		lrg
script		script
test_data		test_data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
setting.env.example		setting.env.example
setup_data.py		setup_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NitiBench

📌 Getting Started

1️⃣ Clone the Repository

2️⃣ Configure API Keys

3️⃣ Build and Run the Docker Container

4️⃣ Expected File Structure

🚀 Using the Benchmark

1️⃣ Generating Responses

2️⃣ Evaluating Responses

Models

Acknowledgement

Contribution

How to Contribute

About

Releases

Packages

Contributors 3

Languages

vistec-AI/nitibench

Folders and files

Latest commit

History

Repository files navigation

NitiBench

📌 Getting Started

1️⃣ Clone the Repository

2️⃣ Configure API Keys

3️⃣ Build and Run the Docker Container

4️⃣ Expected File Structure

🚀 Using the Benchmark

1️⃣ Generating Responses

2️⃣ Evaluating Responses

Models

Acknowledgement

Contribution

How to Contribute

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages