TY - JOUR
T1 - Classifying next-generation sequencing data using a zero-inflated Poisson model
AU - Zhou, Yan
AU - Wan, Xiang
AU - Zhang, Baoxue
AU - Tong, Tiejun
N1 - Funding Information:
Yan Zhou’s research was supported by the National Natural Science Foundation of China (Grant No. 11701385), National Statistical Research Project (Grant No. 2017LY56), the Doctor Start Fund of Guangdong Province (Grant No. 2016A030310062) and the National Social Science Foundation of China (Grant No. 15CTJ008). Xiang Wan’s research was supported by the Hong Kong RGC grant HKBU12202114 and the National Natural Science Foundation of China (Grant No. 61501389). Baoxue Zhang’s research was supported by the National Natural Science Foundation of China (Grant No. 11671268). Tiejun Tong’s research was supported by the Hong Kong Baptist University grants FRG1/16-17/018 and FRG2/16-17/ 074, the Health and Medical Research Fund (Grant No. 04150476) and the National Natural Science Foundation of China (Grant No. 11671338).
PY - 2018/4/15
Y1 - 2018/4/15
N2 - Motivation With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros. Results In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors.
AB - Motivation With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros. Results In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors.
UR - http://www.scopus.com/inward/record.url?scp=85046802495&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btx768
DO - 10.1093/bioinformatics/btx768
M3 - Journal article
C2 - 29186294
AN - SCOPUS:85046802495
SN - 1367-4803
VL - 34
SP - 1329
EP - 1335
JO - Bioinformatics
JF - Bioinformatics
IS - 8
ER -