TY - JOUR
T1 - Benchmarking multi-platform sequencing technologies for human genome assembly
AU - Wang, Jingjing
AU - Veldsman, Werner Pieter
AU - Fang, Xiaodong
AU - Huang, Yufen
AU - Xie, Xuefeng
AU - Lyu, Aiping
AU - Zhang, Lu
N1 - Funding information:
Open project of BGI-Shenzhen, Shenzhen, 518000, China (BGIRSZ20220014); Guangdong Basic and Applied Basic Research Foundation (No. 2021A1515012226); Hong Kong Research Grant Council Early Career Scheme (HKBU 22201419); HKBU interdisciplinary Research Clusters Matching Scheme (No. IRCMS/19-20/D02); HKBU Start-up Grant Tier2 (RC-SGT2/19-20/SCI/007); Guangdong-Hong Kong Technology Cooperation Funding Scheme (GHX/133/20SZ); Shenzhen Science and Technology Innovation Commission-Shenzhen Virtual University Park (SZVUP) Special Fund Project (No. 2021Szvup135).
Publisher copyright:
© The Author(s) 2023. Published by Oxford University Press.
PY - 2023/9
Y1 - 2023/9
N2 - Genome assembly is a computational technique that involves piecing together deoxyribonucleic acid (DNA) fragments generated by sequencing technologies to create a comprehensive and precise representation of the entire genome. Generating a high-quality human reference genome is a crucial prerequisite for comprehending human biology, and it is also vital for downstream genomic variation analysis. Many efforts have been made over the past few decades to create a complete and gapless reference genome for humans by using a diverse range of advanced sequencing technologies. Several available tools are aimed at enhancing the quality of haploid and diploid human genome assemblies, which include contig assembly, polishing of contig errors, scaffolding and variant phasing. Selecting the appropriate tools and technologies remains a daunting task despite several studies have investigated the pros and cons of different assembly strategies. The goal of this paper was to benchmark various strategies for human genome assembly by combining sequencing technologies and tools on two publicly available samples (NA12878 and NA24385) from Genome in a Bottle. We then compared their performances in terms of continuity, accuracy, completeness, variant calling and phasing. We observed that PacBio HiFi long-reads are the optimal choice for generating an assembly with low base errors. On the other hand, we were able to produce the most continuous contigs with Oxford Nanopore long-reads, but they may require further polishing to improve on quality. We recommend using short-reads rather than long-reads themselves to improve the base accuracy of contigs from Oxford Nanopore long-reads. Hi-C is the best choice for chromosome-level scaffolding because it can capture the longest-range DNA connectedness compared to 10× linked-reads and Bionano optical maps. However, a combination of multiple technologies can be used to further improve the quality and completeness of genome assembly. For diploid assembly, hifiasm is the best tool for human diploid genome assembly using PacBio HiFi and Hi-C data. Looking to the future, we expect that further advancements in human diploid assemblers will leverage the power of PacBio HiFi reads and other technologies with long-range DNA connectedness to enable the generation of high-quality, chromosome-level and haplotype-resolved human genome assemblies.
AB - Genome assembly is a computational technique that involves piecing together deoxyribonucleic acid (DNA) fragments generated by sequencing technologies to create a comprehensive and precise representation of the entire genome. Generating a high-quality human reference genome is a crucial prerequisite for comprehending human biology, and it is also vital for downstream genomic variation analysis. Many efforts have been made over the past few decades to create a complete and gapless reference genome for humans by using a diverse range of advanced sequencing technologies. Several available tools are aimed at enhancing the quality of haploid and diploid human genome assemblies, which include contig assembly, polishing of contig errors, scaffolding and variant phasing. Selecting the appropriate tools and technologies remains a daunting task despite several studies have investigated the pros and cons of different assembly strategies. The goal of this paper was to benchmark various strategies for human genome assembly by combining sequencing technologies and tools on two publicly available samples (NA12878 and NA24385) from Genome in a Bottle. We then compared their performances in terms of continuity, accuracy, completeness, variant calling and phasing. We observed that PacBio HiFi long-reads are the optimal choice for generating an assembly with low base errors. On the other hand, we were able to produce the most continuous contigs with Oxford Nanopore long-reads, but they may require further polishing to improve on quality. We recommend using short-reads rather than long-reads themselves to improve the base accuracy of contigs from Oxford Nanopore long-reads. Hi-C is the best choice for chromosome-level scaffolding because it can capture the longest-range DNA connectedness compared to 10× linked-reads and Bionano optical maps. However, a combination of multiple technologies can be used to further improve the quality and completeness of genome assembly. For diploid assembly, hifiasm is the best tool for human diploid genome assembly using PacBio HiFi and Hi-C data. Looking to the future, we expect that further advancements in human diploid assemblers will leverage the power of PacBio HiFi reads and other technologies with long-range DNA connectedness to enable the generation of high-quality, chromosome-level and haplotype-resolved human genome assemblies.
KW - human genome assembly
KW - multi-platform sequencing
KW - contig polishing
KW - scaffolding
KW - diploid assembly
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85172424727&origin=inward
U2 - 10.1093/bib/bbad300
DO - 10.1093/bib/bbad300
M3 - Journal article
C2 - 37594299
SN - 1467-5463
VL - 24
JO - Briefings in Bioinformatics
JF - Briefings in Bioinformatics
IS - 5
M1 - bbad300
ER -