2School of Informatics, Xiamen University, China
* corresponding author
Paper: https://aclanthology.org/2025.emnlp-main.105.pdf
##Evaluation Benchmark## We designed and released F²Bench, which covers 10 demographic group categories, including a range of intersectional combinations, with the goal of comprehensively evaluating the fairness performance of LLMs across diverse population groups.
##Open-ended Tasks## In F²Bench, we propose two open-ended tasks based on text generation and reasoning with factuality consideration. These tasks better reflect real-world usage than traditional closed-ended evaluation.
##Experimental Analysis## Using F²Bench, we evaluated several popular LLMs and compared their performance, analyzed the underlying reasons for such performance, discussed the difference between closed-ended evaluation and open-ended evaluation, and proposed new insights for future training strategies of LLMs.
tqdm
zhipuai
openai
transformers
pandas
itertools
torch
modelscope
openpyxlAs same as McBE
