TY - JOUR
T1 - A multi-modal deep language model for contaminant removal from metagenome-assembled genomes
AU - Zou, Bohao
AU - Wang, Jingjing
AU - Ding, Yi
AU - Zhang, Zhenmiao
AU - Huang, Yufen
AU - Fang, Xiaodong
AU - Cheung, Ka Chun
AU - See, Simon
AU - Zhang, Lu
N1 - The design of the study and the collection, analysis and interpretation of the data were partially supported by the Young Collaborative Research grant (no. C2004-23Y), HMRF (grant no. 11221026), the open project of BGI-Shenzhen, Shenzhen 518000, China (grant no. BGIRSZ20220014) and HKBU Start-up Grant Tier 2 (grant no. RC-SGT2/19-20/SCI/007). We also thank the BGI Research-Shenzhen, the Research Committee of Hong Kong Baptist University, and the Interdisciplinary Research Clusters Matching Scheme for their kind support of this project.
Publisher Copyright:
© The Author(s), under exclusive licence to Springer Nature Limited 2024.
PY - 2024/10
Y1 - 2024/10
N2 - Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.
AB - Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.
UR - http://www.scopus.com/inward/record.url?scp=85205910761&partnerID=8YFLogxK
U2 - 10.1038/s42256-024-00908-5
DO - 10.1038/s42256-024-00908-5
M3 - Journal article
SN - 2522-5839
VL - 6
SP - 1245
EP - 1255
JO - Nature Machine Intelligence
JF - Nature Machine Intelligence
IS - 10
ER -