WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

This repository contains benchmark tasks for evaluating how well Large Language Models (LLMs) understand the societal impacts of disruptive weather events from historical sources.

Data 🗂️

The data is obtained through collaboration with a proprietary archive institution and covers two temporal periods.

Multi-Label Classification

We provide the LongCTX dataset in datasets/LongCTX_Dataset(350).csv and the MixedCTX data in datasets/MixedCTX_Dataset(1386).csv.

These datasets contain labeled historical records of disruptive weather events and their societal impacts. Each entry includes temporal information, weather type, article text, and human-annotated binary labels for six distinct impact categories, serving as ground truth.

Example of instances

ID,Date,Time_Period,Weather_Type,Article,Infrastructural Impact,Political Impact,Financial Impact,Ecological Impact,Agricultural Impact,Human Health Impact
0,18800116,historical,Storm, ... On the 22nd another storm arose, and the sea swept the decks, smashing the bulwarks from the bridge aft, destroying the steering gear and carrying overboard a seaman named Anderson. Next day the storm abated and the ship's course was shaped for this port...,1,0,0,0,0,1

"ID": Unique identifier for each entry.
"Date": Date of the weather event in YYYYMMDD format.
"Time_Period": Classification of the historical period.
"Weather_Type": Type of weather event.
"Article": Text content extracted from historical newspapers describing the event.
"Impact Columns": Six ground-truth binary labels indicating the impact of the weather event.

Question-Answering Ranking

We Share the question-answering candidate pool dataset in datasets/QACandidate_Pool.csv

The candidate pool is constructed from the LongCTX dataset, with each article generating a query based on its impact labels.

Example of instances

id,query,correct_passage_index,passage_1,passage_2, ...,passage_100
0, What specific infrastructure and agricultural impact did the British steamer Canopus experience..., 12, p1, p2, ..., p100

"ID": Unique identifier for each entry.
"Query": Pseudo question generated for question answering.
"Correct_passage_index": the index number of the ground-truth passage in the 100 passages candidate pool.
"Passage columns": Candidate pool [1 ground-truth + 99 noise columns]

How to use

Our code repository provides two tasks to evaluate large language models (LLMs) on understanding the impacts of disruptive weather.

Multi-Label Classification 🔍

The multi-label classification task assesses an LLM’s ability to identify disruptive weather impacts in given articles.

To run this task:

navigate to the task directory cd ./Multi-label_Task.
run pip install -r requirements.txt to install the required packages.
configure model settings in model_eval.py.
- set model_name to your desired model.
update your-input.csv (options are LongCTX and MixedCTX, see Data), and specify your-output.csv for saving results.
run the evaluation script: python model_eval.py.
evaluate metrics using python metrics.py.

Question-Answering Ranking 🥇

The question-answering ranking task evaluates an LLM’s ability to determine the likelihood that a given article contains the correct answer based on its parametric knowledge.

To run this task:

navigate to the task directory cd ./QA-ranking_Task.
run pip install -r requirements.txt to install the required packages.
configure model settings in model_eval.py.
- set model_name to your desired model.
run the evaluation python model_eval.py.
for data evaluation, change your-output.json in metrics.py to the saved result from model_eval.py.
run the metrics evaluation script metrics.py.

F1 scores by impact category Row-wise accuracy Retrieval performance metrics (NDCG, MRR)

For detailed performance analysis, please refer to our paper.

Citation

Please kindly cite the following paper if you found our benchmark helpful!

@misc{yu2025wximpactbenchdisruptiveweatherimpact,
      title={WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models}, 
      author={Yongan Yu and Qingchen Hu and Xianda Du and Jiayin Wang and Fengran Mo and Renee Sieber},
      year={2025},
      eprint={2505.20249},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20249}, 
}

Miscellaneous

Please send any questions about the resources to yongan.yu@mail.mcgill.ca

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
Multi-label_Task		Multi-label_Task
QA-ranking_Task		QA-ranking_Task
Tools		Tools
__pycache__		__pycache__
datasets		datasets
pics		pics
result_processing		result_processing
.DS_Store		.DS_Store
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Table of Contents

Data 🗂️

Multi-Label Classification

Question-Answering Ranking

How to use

Multi-Label Classification 🔍

Question-Answering Ranking 🥇

Citation

Miscellaneous

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Michaelyya/WXImpactBench

Folders and files

Latest commit

History

Repository files navigation

WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Table of Contents

Data 🗂️

Multi-Label Classification

Question-Answering Ranking

How to use

Multi-Label Classification 🔍

Question-Answering Ranking 🥇

Citation

Miscellaneous

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages