TY - JOUR
T1 - A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants
AU - Wang, Chonghao
AU - Zhang, Jing
AU - Veldsman, Werner Pieter
AU - Zhou, Xin
AU - Zhang, Lu
N1 - Funding information:
L.Z. is supported by a Research Grant Council Early Career Scheme (HKBU 22201419), a Guangdong-Hong Kong Technology Cooperation Funding Scheme (GHX/133/20SZ), an IRCMS HKBU (No. IRCMS/19-20/D02), a HKBU Start-up Grant Tier 2 (RC-SGT2/19-20/SCI/007), a grant from the Guangdong Basic and Applied Basic Research Foundation (No. 2021A1515012226), and a grant from Shenzhen Science and Technology Innovation Commission (SZSTI) - Shenzhen Virtual University Park (SZVUP) Special Fund Project (No. 2021Szvup135).
Publisher copyright:
© The Author(s) 2022. Published by Oxford University Press. All rights reserved.
PY - 2023/1
Y1 - 2023/1
N2 - Quantifying an individual’s risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.
AB - Quantifying an individual’s risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.
KW - common diseases
KW - disease heritability
KW - genomic variants
KW - polygenic risk scores
KW - sample size
KW - SNP effect size
UR - http://www.scopus.com/inward/record.url?scp=85147044597&partnerID=8YFLogxK
U2 - 10.1093/bib/bbac552
DO - 10.1093/bib/bbac552
M3 - Journal article
SN - 1467-5463
VL - 24
JO - Briefings in Bioinformatics
JF - Briefings in Bioinformatics
IS - 1
M1 - bbac552
ER -