A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants

Chonghao Wang, Jing Zhang, Werner Pieter Veldsman, Xin Zhou, Lu Zhang*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review


Quantifying an individual’s risk for common diseases is an important goal of precision health. The polygenic risk score (PRS), which aggregates multiple risk alleles of candidate diseases, has emerged as a standard approach for identifying high-risk individuals. Although several studies have been performed to benchmark the PRS calculation tools and assess their potential to guide future clinical applications, some issues remain to be further investigated, such as lacking (i) various simulated data with different genetic effects; (ii) evaluation of machine learning models and (iii) evaluation on multiple ancestries studies. In this study, we systematically validated and compared 13 statistical methods, 5 machine learning models and 2 ensemble models using simulated data with additive and genetic interaction models, 22 common diseases with internal training sets, 4 common diseases with external summary statistics and 3 common diseases for trans-ancestry studies in UK Biobank. The statistical methods were better in simulated data from additive models and machine learning models have edges for data that include genetic interactions. Ensemble models are generally the best choice by integrating various statistical methods. LDpred2 outperformed the other standalone tools, whereas PRS-CS, lassosum and DBSLMM showed comparable performance. We also identified that disease heritability strongly affected the predictive performance of all methods. Both the number and effect sizes of risk SNPs are important; and sample size strongly influences the performance of all methods. For the trans-ancestry studies, we found that the performance of most methods became worse when training and testing sets were from different populations.
Original languageEnglish
Article numberbbac552
Number of pages15
JournalBriefings in Bioinformatics
Issue number1
Publication statusPublished - Jan 2023

Scopus Subject Areas

  • Information Systems
  • Molecular Biology

User-Defined Keywords

  • common diseases
  • disease heritability
  • genomic variants
  • polygenic risk scores
  • sample size
  • SNP effect size


Dive into the research topics of 'A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants'. Together they form a unique fingerprint.

Cite this