TY - JOUR
T1 - Causal Transformer for Learning Embeddings from Structured Medical History Records and Multi-Source Data Integration for Complex Disease Risk Prediction
AU - Li, Zeming
AU - Xu, Yu
AU - Chowdhury, Debajyoti
AU - Yip, Hip Fung
AU - Wang, Chonghao
AU - Zhang, Lu
N1 - Open access funding provided by Hong Kong Baptist University Library. L.Z. was supported by a Young Collaborative Research grant (No. C2004-23Y), HMRF grant (No. 11221026) and Guangdong-Hong Kong Technology Cooperation Funding Scheme (No. GHX/133/20SZ).
Publisher Copyright:
© The Author(s) 2025.
PY - 2025/9/17
Y1 - 2025/9/17
N2 - Traditional disease risk prediction models predominantly rely on statistical algorithms and often focus on genetic factors or a limited set of lifestyle factors to estimate the risk of disease onset. Recently, more comprehensive approaches have emerged that integrate genetic factors with additional lifestyle factors (e.g., alcohol intake) and physical features (e.g., body mass index, age) to increase predictive accuracy. Since the onset of complex diseases is often accompanied by the occurrence of comorbidities, incorporating medical history records is a critical yet underexplored avenue for improving risk prediction. In this study, we propose a novel framework, MIDRP (Multi-source Integration for Disease Risk Prediction), which incorporates genetic variants, lifestyle factors, physical attributes, and medical history records to achieve more robust and accurate predictions. At the heart of our approach lies a causal Transformer architecture, specifically designed to extract and interpret nuanced patterns from medical history records. In the experiments, we compared MIDRP with several baselines, including LDPred2, random forest, multilayer perception, logistic regression, AdaBoost, DiseaseCapsule, EIR, and Med-Bert, on three complex diseases Coronary Artery Disease, Type 2 Diabetes, and Breast Cancer using data from the UK Biobank. Our method achieved state-of-the-art performance, AUROC scores of 0.783, 0.841, and 0.784, respectively, demonstrating its potential in the field of complex disease risk prediction.
AB - Traditional disease risk prediction models predominantly rely on statistical algorithms and often focus on genetic factors or a limited set of lifestyle factors to estimate the risk of disease onset. Recently, more comprehensive approaches have emerged that integrate genetic factors with additional lifestyle factors (e.g., alcohol intake) and physical features (e.g., body mass index, age) to increase predictive accuracy. Since the onset of complex diseases is often accompanied by the occurrence of comorbidities, incorporating medical history records is a critical yet underexplored avenue for improving risk prediction. In this study, we propose a novel framework, MIDRP (Multi-source Integration for Disease Risk Prediction), which incorporates genetic variants, lifestyle factors, physical attributes, and medical history records to achieve more robust and accurate predictions. At the heart of our approach lies a causal Transformer architecture, specifically designed to extract and interpret nuanced patterns from medical history records. In the experiments, we compared MIDRP with several baselines, including LDPred2, random forest, multilayer perception, logistic regression, AdaBoost, DiseaseCapsule, EIR, and Med-Bert, on three complex diseases Coronary Artery Disease, Type 2 Diabetes, and Breast Cancer using data from the UK Biobank. Our method achieved state-of-the-art performance, AUROC scores of 0.783, 0.841, and 0.784, respectively, demonstrating its potential in the field of complex disease risk prediction.
KW - Deep learning
KW - Genome wide association study
KW - Medical history record
KW - Polygenic risk score
KW - Single nucleotide polymorphism
UR - https://www.scopus.com/pages/publications/105016704894
U2 - 10.1007/s12539-025-00749-9
DO - 10.1007/s12539-025-00749-9
M3 - Journal article
AN - SCOPUS:105016704894
SN - 1913-2751
JO - Interdisciplinary Sciences - Computational Life Sciences
JF - Interdisciplinary Sciences - Computational Life Sciences
ER -