TY - JOUR
T1 - PuzzleCluster: A Novel Unsupervised Clustering Algorithm for Binning DNA Fragments in Metagenomics
AU - Siege, Kyler
AU - Altenburger, Kristen
AU - Hon, Yu-Sing
AU - Lin, Jessey
AU - Yu, Chenglong
N1 - Funding Information:
The authors are very grateful for the opportunity to conduct this research project as part of the Research in Industrial Projects for Students (RIPS) program hosted by the Institute for Pure and Applied Mathematics (IPAM). RIPS is supported by grant DMS-0931852 of the National Science Foundation. The last author also thanks Prof. Stephen Yau with Tsinghua University for his valuable discussion on this paper.
Publisher copyright:
© 2015 Bentham Science Publishers
PY - 2015
Y1 - 2015
N2 - Metagenomic datasets are composed of DNA fragments from large numbers of different and potentially novel organisms. These datasets can contain up to several million sequences taken from heterogeneous populations of extremely varied abundance. Unlike traditional genomic studies, metagenomic analysis requires an additional binning step. This process groups DNA fragments from the same or similar species of origin. However, existing unsupervised metagenomic binning programs cannot accurately analyze datasets containing a large number of species or with significantly unbalanced abundance ratios. To improve upon these current limitations, we present PuzzleCluster, a novel unsupervised binning algorithm. PuzzleCluster incorporates a unique cluster refinement step by automatically grouping reads which share a nucleotide word (i.e. reverse complement pairs) of a predetermined length. Additionally, the clustering parameters are estimated by fitting the Jensen-Shannon distance among sequences using the expectation maximization algorithm. Since clustering parameters are computed based on each dataset, our approach can adapt to the peculiarities of each dataset and is not confined by universal parameters. Furthermore, PuzzleCluster utilizes no prior assumptions about the genetic makeup or number of organisms present in the sample, making it well-suited for applications with a large amount of biodiversity and completely unknown organisms. As a comparison, PuzzleCluster has an accuracy 9%, 19.8%, 15.7%, and 19.5% higher than MetaCluster 3.0 for taxonomic levels phylum, class, order, and family, respectively. PuzzleCluster source code is freely available at http://math.stanford.edu/~ksiegel/PuzzleCluster.html.
AB - Metagenomic datasets are composed of DNA fragments from large numbers of different and potentially novel organisms. These datasets can contain up to several million sequences taken from heterogeneous populations of extremely varied abundance. Unlike traditional genomic studies, metagenomic analysis requires an additional binning step. This process groups DNA fragments from the same or similar species of origin. However, existing unsupervised metagenomic binning programs cannot accurately analyze datasets containing a large number of species or with significantly unbalanced abundance ratios. To improve upon these current limitations, we present PuzzleCluster, a novel unsupervised binning algorithm. PuzzleCluster incorporates a unique cluster refinement step by automatically grouping reads which share a nucleotide word (i.e. reverse complement pairs) of a predetermined length. Additionally, the clustering parameters are estimated by fitting the Jensen-Shannon distance among sequences using the expectation maximization algorithm. Since clustering parameters are computed based on each dataset, our approach can adapt to the peculiarities of each dataset and is not confined by universal parameters. Furthermore, PuzzleCluster utilizes no prior assumptions about the genetic makeup or number of organisms present in the sample, making it well-suited for applications with a large amount of biodiversity and completely unknown organisms. As a comparison, PuzzleCluster has an accuracy 9%, 19.8%, 15.7%, and 19.5% higher than MetaCluster 3.0 for taxonomic levels phylum, class, order, and family, respectively. PuzzleCluster source code is freely available at http://math.stanford.edu/~ksiegel/PuzzleCluster.html.
KW - Clustering
KW - expectation maximization
KW - Jensen-Shannon distance
KW - metagenome
KW - quality threshold algorithm
KW - word agreement
UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-84989221532&partnerID=MN8TOARS
M3 - Journal article
SN - 1574-8936
VL - 10
SP - 225
EP - 231
JO - Current Bioinformatics
JF - Current Bioinformatics
IS - 2
ER -