Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus

Zhixiang Wang, Jing Sun, Hui Liu, Xufei Luo, Jia Li, Wenjun He, Zhenhua Yang, Han Lv*, Yaolong Chen*, Zhenchang Wang*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Aim: This study aimed to develop and evaluate an automated large language model (LLM)-based system for assessing the quality of medical imaging guidelines and consensus (GACS) in different languages, focusing on enhancing evaluation efficiency, consistency, and reducing manual workload. 

Method: We developed the QPC-HASE-GuidelineEval algorithm, which integrates a Four-Quadrant Questions Classification Strategy and Hybrid Search Enhancement. The model was validated on 45 medical imaging guidelines (36 in Chinese and 9 in English) published in 2021 and 2022. Key evaluation metrics included consistency with expert assessments, hybrid search paragraph matching accuracy, information completeness, comparisons of different paragraph matching approaches, and cost-time efficiency. 

Results: The algorithm demonstrated an average accuracy of 77%, excelling in simpler tasks but showing lower accuracy (29%–40%) in complex evaluations, such as explanations and visual aids. The average accuracy rates of the English and Chinese versions of the GACS were 74% and 76%, respectively (p = 0.37). Hybrid search demonstrated superior performance with paragraph matching accuracy (4.42) and information completeness (4.42), significantly outperforming keyword-based search (1.05/1.05) and sparse-dense retrieval (4.26/3.63). The algorithm significantly reduced evaluation time to 8 min and 30 s per guideline and reduced costs to approximately 0.5 USD per guideline, offering a considerable advantage over traditional manual methods. 

Conclusion: The QPC-HASE-GuidelineEval algorithm, powered by LLMs, showed strong potential for improving the efficiency, scalability, and multi-language capability of guideline evaluations, though further enhancements are needed to handle more complex tasks that require deeper interpretation.

Original languageEnglish
Article numbere70020
JournalJournal of Evidence-Based Medicine
Volume18
Issue number2
DOIs
Publication statusPublished - Jun 2025

User-Defined Keywords

  • consensus
  • guideline transparency
  • guideline
  • large language model
  • medical imaging
  • quality assessment

Fingerprint

Dive into the research topics of 'Development and Performance of a Large Language Model for the Quality Evaluation of Multi-Language Medical Imaging Guidelines and Consensus'. Together they form a unique fingerprint.

Cite this