Skip to main navigation Skip to search Skip to main content

Enhancing LLM-based medical decision-making by test-time knowledge acquisition

  • Shipeng Li
  • , Liuxin Bao
  • , Shikun Li*
  • , Bo Wan*
  • *Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

1 Citation (Scopus)

Abstract

Purpose: Medical decision-making (MDM) is a complex clinical reasoning process that requires the systematic integration of multidisciplinary knowledge and evidence. Current approaches based on large language models (LLMs) are constrained by their reliance on static training corpora and often exhibit limited domain-specific adaptation, which can compromise diagnostic accuracy and reliability. This study aims to overcome these limitations by developing a framework that enables LLMs to dynamically acquire and refine knowledge during test time, thereby enhancing the robustness and precision of MDM systems.

Methods: We propose a test-time optimization framework that refines a frozen LLM’s diagnostic reasoning through test-time knowledge acquisition and integration. For each medical query, the model generates multiple trajectories that are synthesized into a pseudo reference answer, whose self-consistency score separates confident from unconfident cases. Confident cases enable reward-guided reflection to extract reliable diagnostic heuristics, while unconfident cases undergo unsupervised reflection to reveal reasoning gaps and uncertainty patterns. The extracted knowledge is continually incorporated into an evolving, capacity-controlled knowledge base through operations that add, modify, or merge knowledge. This updated knowledge base then guides subsequent inference, allowing the model to adapt its reasoning strategy during test time without updating any parameters.

Results: Experimental evaluations on three public medical decision-making benchmarks—MedQA, NEJMQA, and MMLU-Pro-Health—show that the proposed framework consistently improves the performance of the state-of-the-art LLM, DeepSeekv3.2 Exp 671B. For example, on the MMLU-Pro-Health dataset, our method achieved an average accuracy of 79.22%, surpassing DeepSeekv3.2 Exp 671B by 1.84 percentage points, thus demonstrating the effectiveness of the framework in enhancing diagnostic decision-making.

Conclusion: By leveraging inference-time self-evaluation and experience accumulation, this work introduces a new paradigm for building reliable, adaptive, and context-aware medical AI systems. It underscores the critical role of continual knowledge evolution in advancing trustworthy artificial intelligence for clinical decision support and lays the foundation for future developments in dynamic and responsive medical reasoning tools.
Original languageEnglish
Article number51
Number of pages14
JournalHealth Information Science and Systems
Volume14
Issue number1
Early online date6 Apr 2026
DOIs
Publication statusE-pub ahead of print - 6 Apr 2026

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

User-Defined Keywords

  • Medical decision-making
  • Large language model
  • Dynamic knowledge acquisition
  • Test-time compute

Fingerprint

Dive into the research topics of 'Enhancing LLM-based medical decision-making by test-time knowledge acquisition'. Together they form a unique fingerprint.

Cite this