TY - JOUR
T1 - Performance unfairness of large language models in cross-language fact-checking
AU - WANG, Dandan
AU - TSANG, Stephanie Jean
AU - Zhou, Yadong
N1 - This work was supported by the Initiation Grant for Faculty Niche Research Areas, Hong Kong Baptist University [Grant number RC-FNRA-IG/21-22/ARTS/01].
Publisher copyright:
© 2026 The Author(s). Published by Elsevier Ltd.
PY - 2026/1/13
Y1 - 2026/1/13
N2 - Large language models (LLMs) are increasingly used for automated fact-checking, yet their performance often varies across languages, raising global fairness concerns. This study evaluated cross-language inequality in LLM-based fact-checking using 4,500 claims spanning nine languages across six language families. Besides building a systematic performance-evaluation pipeline covering instruction following, authenticity classification, evidence generation, and checking-worthiness scoring, we quantified inequality using standard deviation, coefficient of variation, Gini coefficient, and Theil index. Results showed substantial cross-language disparities, with higher performance on claims from rich-resource languages. To mitigate inequality, we tested two interventions, role-restricted prompt engineering and model fine-tuning. Both approaches reduced disparities, with fine-tuning achieving the largest and most consistent improvement across languages, particularly in checking-worthiness scoring. This study provides a reproducible framework for quantifying multilingual performance and fairness in LLM-based fact-checking and offers practical guidance for developing more equitable verification systems across diverse linguistic contexts.
AB - Large language models (LLMs) are increasingly used for automated fact-checking, yet their performance often varies across languages, raising global fairness concerns. This study evaluated cross-language inequality in LLM-based fact-checking using 4,500 claims spanning nine languages across six language families. Besides building a systematic performance-evaluation pipeline covering instruction following, authenticity classification, evidence generation, and checking-worthiness scoring, we quantified inequality using standard deviation, coefficient of variation, Gini coefficient, and Theil index. Results showed substantial cross-language disparities, with higher performance on claims from rich-resource languages. To mitigate inequality, we tested two interventions, role-restricted prompt engineering and model fine-tuning. Both approaches reduced disparities, with fine-tuning achieving the largest and most consistent improvement across languages, particularly in checking-worthiness scoring. This study provides a reproducible framework for quantifying multilingual performance and fairness in LLM-based fact-checking and offers practical guidance for developing more equitable verification systems across diverse linguistic contexts.
KW - Fact-checking
KW - Large language models
KW - Cross-language unfairness
KW - Prompt engineering
KW - Model fine-tuning
U2 - 10.1016/j.ipm.2026.104616
DO - 10.1016/j.ipm.2026.104616
M3 - Journal article
SN - 0306-4573
VL - 63
JO - Information Processing and Management
JF - Information Processing and Management
IS - 4
M1 - 104616
ER -