TY - JOUR
T1 - A visually grounded language model for fetal ultrasound understanding
AU - Guo, Xiaoqing
AU - Alsharid, Mohammad
AU - Zhao, He
AU - Wang, Yipei
AU - Lander, Jayne
AU - Papageorghiou, Aris T.
AU - Noble, J. Alison
N1 - We acknowledge UKRI grant reference (EP/X040186/1), EPSRC grant (EP/T028572/1) and ERC grant (ERC-ADG-2015 694581, project PULSE). X.G. is also supported by Hong Kong Research Grants Council (RGC) Early Career Scheme grant 22203525. A.T.P. and J.A.N. are funded by the National Institute for Health and Care Research (NIHR) Oxford Biomedical Research Centre (BRC). We also thank T. Han for sharing valuable comments.
Publisher Copyright:
© Crown 2026.
PY - 2026/1/15
Y1 - 2026/1/15
N2 - Freehand fetal ultrasound examinations require substantial clinical skill. Here we propose Sonomate (mate of a sonographer), an AI assistant to a user during fetal ultrasound examinations. Sonomate is based on aligning video features and text features derived from transcribed audio to facilitate real-time interactions between an ultrasound machine and a user. Our approach combines coarse-grained video–text alignment with fine-grained image–sentence alignment to build a robust visually grounded language model capable of understanding fetal ultrasound videos. To tackle the challenges associated with heterogeneous language and asynchronous content in real-world video–audio pairs, we design the anatomy-aware alignment and context label correction in the fine-grained alignment. Sonomate is effective at anatomy detection in fetal ultrasound images without the need for retraining on manually annotated data. Furthermore, Sonomate shows promising performance in visual question answering for both fetal ultrasound images and videos. Guardrails are built to ensure the safety of Sonomate during deployment. This advancement paves the way towards AI-assistive technology being used to support sonography training and enhanced diagnostic capabilities.
AB - Freehand fetal ultrasound examinations require substantial clinical skill. Here we propose Sonomate (mate of a sonographer), an AI assistant to a user during fetal ultrasound examinations. Sonomate is based on aligning video features and text features derived from transcribed audio to facilitate real-time interactions between an ultrasound machine and a user. Our approach combines coarse-grained video–text alignment with fine-grained image–sentence alignment to build a robust visually grounded language model capable of understanding fetal ultrasound videos. To tackle the challenges associated with heterogeneous language and asynchronous content in real-world video–audio pairs, we design the anatomy-aware alignment and context label correction in the fine-grained alignment. Sonomate is effective at anatomy detection in fetal ultrasound images without the need for retraining on manually annotated data. Furthermore, Sonomate shows promising performance in visual question answering for both fetal ultrasound images and videos. Guardrails are built to ensure the safety of Sonomate during deployment. This advancement paves the way towards AI-assistive technology being used to support sonography training and enhanced diagnostic capabilities.
UR - https://www.scopus.com/pages/publications/105027662503
U2 - 10.1038/s41551-025-01578-3
DO - 10.1038/s41551-025-01578-3
M3 - Journal article
SN - 2157-846X
JO - Nature Biomedical Engineering
JF - Nature Biomedical Engineering
ER -