NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction (Link to the paper)
Numpert aims to investigate the numerical reasoning and classification capabilities of large language models in claim-and-evidence-based text inputs. The primary goal is to develop more robust approaches for numerical reasoning in language models.
..\code\data_preprocessing\raw_dataset_preprocessing for cleaning dataset, e.g. removing verdict fact-checkers verdict in evidence, createing binary dataset and tag numerical values with spaCy
..\code\llm_eval\ scripts to evaluate LLMs using different wrappers connecting to APIs.
..\code\metrics\ scripts to calculate scores
..\code\error_analysis\ scripts used to extract reasoning tokens from reasoning models for manual analysis
Contains raw and perturbed data
Baseline (unperturbed claims) in a separate folder for the evaluated models.
Subdirs for each model for claims that are perturbed. These subdirs are separated into zero shot and two shot evaluation, and lastly for some select models, the PAP (called neg_shot in the folder structure).
Before perturbing and evaluating the models, we do perform some data proprocessing. Files are found in . ..\code\data_preprocessing\perturbutations\
- Remove the third class (Conflicting) class, so the dataset only contain True and False classifications with the create_binary_dataset.py
- remove_reference.py removes the last part of the evidence of the evidence document of the dataset, so the verdict is not concluded in the document–makes it so the model have to infer the correct values instead of relying on fact-checker verdict.
- process_claims.py normalizes the data from step 2. Converts words to numbers for claims, and does named entity recognication on claims to extract tokens with numerical values.
- main_perturb.py to run all perturbation types on data from step 3.
Use files from ..\code\llm_eval\ dir. We use different files for different models types (OpenAI, Google or open-weight models using Ollama wrapper). Use argsparser to configure models type, input/output paths, api-configs, and other miscellaneous configurations.
Directory also includes a json_to_jsonl.py to set a response format indended for OpenAI's batch evaluation.
The following tables presents accuracy performance for different datasplits. Red -x indicates a drop; Green +x indicates an increase. Values in bold denote the highest accuracy within each perturbation setting, separated by open-weight and proprietary models. PAP denote the perturbation aware prompt setting.
@misc{aarnes2025numpertnumericalperturbationsprobe,
title={NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction},
author={Peter Røysland Aarnes and Vinay Setty},
year={2025},
eprint={2511.09971},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.09971},
}