TY - JOUR
T1 - Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology
AU - Yan, Zijun
AU - Fan, Ke Qin
AU - Zhang, Qi
AU - Wu, Xinyan
AU - Chen, Yuquan
AU - Wu, Xinyu
AU - Yu, Ting
AU - Su, Ning
AU - Zou, Yan
AU - Chi, Hao
AU - Xia, Liangjing
AU - Cao, Qiang
N1 - Open access funding provided by Hong Kong Baptist University Library. This work was supported by National Natural Science Foundation of China (No.42267063), The 2024 Healthcare Quality (Evidence-Based) Management Research Project of the National Institute of Hospital Administration, National Health Commission of the People’s Republic of China (YLZLXZ24G039), Sichuan Provincial Administration of Traditional Chinese Medicine Research Project (2023MS057; 2023MS207),Yunnan Provincial Department of Science and Technology Joint Project of Local Universities (202001BA070001-041), Key project of popular science research of Chinese Pharmaceutical Association (CMEI2024KPYJ(JZYY)00427).
Publisher Copyright:
© The Author(s) 2025.
PY - 2025/7/7
Y1 - 2025/7/7
N2 - Objectives: We sought to compare how DeepSeek‑V3, DeepSeek‑R1, OpenAI o3‑mini, and OpenAI o3‑mini high handle urological questions, especially in areas such as benign prostatic enlargement, urinary stones, infections, and guideline updates. The intent was to identify how these text‑creation platforms might aid clinical practice without overlooking potential gaps in accuracy.Methods: A set of 34 routinely asked questions plus 25 queries based on newly revised guidelines was assembled. Six board‑certified urologists independently scored each system’s replies using a five‑point scale. Questions scoring below a set threshold were reintroduced to the same system, accompanied by critiques, to gauge self‑correction. Statistical analyses focused on total scores, percentage of excellent ratings, and improvements after iterative prompting. Results: Across all 59 queries (34 general plus 25 guideline-based), OpenAI o3-mini high recorded the highest median total score (22 [20–24]), significantly outperforming DeepSeek-R1, DeepSeek-V3 and OpenAI o3-mini (all pair-wise p < 0.01). DeepSeek-R1’s accuracy approached that of o3-mini high in patient-counseling items, where their excellent-answer rates were 49% and 57%, respectively. DeepSeek‑V3 achieved solid baseline correctness but made fewer successful corrections on subsequent attempts. Although OpenAI o3‑mini initially produced more concise responses, it showed a surprisingly strong capacity to revise earlier errors.Conclusion: OpenAI o3‑mini high, followed by DeepSeek‑R1, provided the most reliable answers for modern urological concerns, whereas DeepSeek‑V3 exhibited limited adaptability during re‑evaluation. Despite often briefer replies, OpenAI o3‑mini outdid DeepSeek‑V3 in self‑correction. These findings indicate that, when reviewed by a clinician, o3-mini high can serve as a rapid second-opinion tool for outpatient counselling and protocol updates, whereas DeepSeek-R1 may provide a cost-effective alternative in resource-limited settings.
AB - Objectives: We sought to compare how DeepSeek‑V3, DeepSeek‑R1, OpenAI o3‑mini, and OpenAI o3‑mini high handle urological questions, especially in areas such as benign prostatic enlargement, urinary stones, infections, and guideline updates. The intent was to identify how these text‑creation platforms might aid clinical practice without overlooking potential gaps in accuracy.Methods: A set of 34 routinely asked questions plus 25 queries based on newly revised guidelines was assembled. Six board‑certified urologists independently scored each system’s replies using a five‑point scale. Questions scoring below a set threshold were reintroduced to the same system, accompanied by critiques, to gauge self‑correction. Statistical analyses focused on total scores, percentage of excellent ratings, and improvements after iterative prompting. Results: Across all 59 queries (34 general plus 25 guideline-based), OpenAI o3-mini high recorded the highest median total score (22 [20–24]), significantly outperforming DeepSeek-R1, DeepSeek-V3 and OpenAI o3-mini (all pair-wise p < 0.01). DeepSeek-R1’s accuracy approached that of o3-mini high in patient-counseling items, where their excellent-answer rates were 49% and 57%, respectively. DeepSeek‑V3 achieved solid baseline correctness but made fewer successful corrections on subsequent attempts. Although OpenAI o3‑mini initially produced more concise responses, it showed a surprisingly strong capacity to revise earlier errors.Conclusion: OpenAI o3‑mini high, followed by DeepSeek‑R1, provided the most reliable answers for modern urological concerns, whereas DeepSeek‑V3 exhibited limited adaptability during re‑evaluation. Despite often briefer replies, OpenAI o3‑mini outdid DeepSeek‑V3 in self‑correction. These findings indicate that, when reviewed by a clinician, o3-mini high can serve as a rapid second-opinion tool for outpatient counselling and protocol updates, whereas DeepSeek-R1 may provide a cost-effective alternative in resource-limited settings.
KW - Clinical guidelines
KW - Large language models
KW - Performance evaluation
KW - Self‑correction capacity
KW - Urology
UR - http://www.scopus.com/inward/record.url?scp=105009985789&partnerID=8YFLogxK
U2 - 10.1007/s00345-025-05757-4
DO - 10.1007/s00345-025-05757-4
M3 - Journal article
C2 - 40622427
AN - SCOPUS:105009985789
SN - 0724-4983
VL - 43
JO - World Journal of Urology
JF - World Journal of Urology
IS - 1
M1 - 416
ER -