Performance unfairness of large language models in cross-language fact-checking

Research output: Contribution to journalJournal articlepeer-review

Abstract

Large language models (LLMs) are increasingly used for automated fact-checking, yet their performance often varies across languages, raising global fairness concerns. This study evaluated cross-language inequality in LLM-based fact-checking using 4,500 claims spanning nine languages across six language families. Besides building a systematic performance-evaluation pipeline covering instruction following, authenticity classification, evidence generation, and checking-worthiness scoring, we quantified inequality using standard deviation, coefficient of variation, Gini coefficient, and Theil index. Results showed substantial cross-language disparities, with higher performance on claims from rich-resource languages. To mitigate inequality, we tested two interventions, role-restricted prompt engineering and model fine-tuning. Both approaches reduced disparities, with fine-tuning achieving the largest and most consistent improvement across languages, particularly in checking-worthiness scoring. This study provides a reproducible framework for quantifying multilingual performance and fairness in LLM-based fact-checking and offers practical guidance for developing more equitable verification systems across diverse linguistic contexts.
Original languageEnglish
Article number104616
Number of pages32
JournalInformation Processing and Management
Volume63
Issue number4
Early online date13 Jan 2026
DOIs
Publication statusE-pub ahead of print - 13 Jan 2026

User-Defined Keywords

  • Fact-checking
  • Large language models
  • Cross-language unfairness
  • Prompt engineering
  • Model fine-tuning

Fingerprint

Dive into the research topics of 'Performance unfairness of large language models in cross-language fact-checking'. Together they form a unique fingerprint.

Cite this