TY - JOUR
T1 - Exploring high-quality microbial genomes by assembling short-reads with long-range connectivity
AU - Zhang, Zhenmiao
AU - Xiao, Jin
AU - Wang, Hongbo
AU - Yang, Chao
AU - Huang, Yufen
AU - Yue, Zhen
AU - Chen, Yang
AU - Han, Lijuan
AU - Yin, Kejing
AU - Lyu, Aiping
AU - Fang, Xiaodong
AU - Zhang, Lu
N1 - The design of the study and the collection, analysis, and interpretation of the data were partially supported by the Young Collaborative Research Grant (C2004-23Y, L.Z.), HMRF (11221026, L.Z.), the Science Technology and Innovation Committee of Shenzhen Municipality, China (SGDX20190919142801722, XD.F.), the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220012, L.Z. and BGIRSZ20220014, K.J.Y.), the Hong Kong Research Grant Council Early Career Scheme (HKBU 22201419, L.Z.), HKBU Start-up Grant Tier 2 (RC-SGT2/19-20/SCI/007, L.Z.), HKBU IRCMS (No. IRCMS/19-20/D02, L.Z.).
Publisher copyright:
© 2024. The Author(s).
PY - 2024/5/31
Y1 - 2024/5/31
N2 - Although long-read sequencing enables the generation of complete genomes for unculturable microbes, its high cost limits the widespread adoption of long-read sequencing in large-scale metagenomic studies. An alternative method is to assemble short-reads with long-range connectivity, which can be a cost-effective way to generate high-quality microbial genomes. Here, we develop Pangaea, a bioinformatic approach designed to enhance metagenome assembly using short-reads with long-range connectivity. Pangaea leverages connectivity derived from physical barcodes of linked-reads or virtual barcodes by aligning short-reads to long-reads. Pangaea utilizes a deep learning-based read binning algorithm to assemble co-barcoded reads exhibiting similar sequence contexts and abundances, thereby improving the assembly of high- and medium-abundance microbial genomes. Pangaea also leverages a multi-thresholding algorithm strategy to refine assembly for low-abundance microbes. We benchmark Pangaea on linked-reads and a combination of short- and long-reads from simulation data, mock communities and human gut metagenomes. Pangaea achieves significantly higher contig continuity as well as more near-complete metagenome-assembled genomes (NCMAGs) than the existing assemblers. Pangaea also generates three complete and circular NCMAGs on the human gut microbiomes.
AB - Although long-read sequencing enables the generation of complete genomes for unculturable microbes, its high cost limits the widespread adoption of long-read sequencing in large-scale metagenomic studies. An alternative method is to assemble short-reads with long-range connectivity, which can be a cost-effective way to generate high-quality microbial genomes. Here, we develop Pangaea, a bioinformatic approach designed to enhance metagenome assembly using short-reads with long-range connectivity. Pangaea leverages connectivity derived from physical barcodes of linked-reads or virtual barcodes by aligning short-reads to long-reads. Pangaea utilizes a deep learning-based read binning algorithm to assemble co-barcoded reads exhibiting similar sequence contexts and abundances, thereby improving the assembly of high- and medium-abundance microbial genomes. Pangaea also leverages a multi-thresholding algorithm strategy to refine assembly for low-abundance microbes. We benchmark Pangaea on linked-reads and a combination of short- and long-reads from simulation data, mock communities and human gut metagenomes. Pangaea achieves significantly higher contig continuity as well as more near-complete metagenome-assembled genomes (NCMAGs) than the existing assemblers. Pangaea also generates three complete and circular NCMAGs on the human gut microbiomes.
UR - http://doi.org/10.1038/s41467-024-51890-w
UR - http://www.scopus.com/inward/record.url?scp=85194998293&partnerID=8YFLogxK
U2 - 10.1038/s41467-024-49060-z
DO - 10.1038/s41467-024-49060-z
M3 - Journal article
C2 - 38821971
SN - 2041-1723
VL - 15
JO - Nature Communications
JF - Nature Communications
IS - 1
M1 - 4631
ER -