TY - JOUR
T1 - NBLDA
T2 - Negative binomial linear discriminant analysis for RNA-Seq data
AU - Dong, Kai
AU - Zhao, Hongyu
AU - Tong, Tiejun
AU - Wan, Xiang
N1 - Funding Information:
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. Hongyu Zhao's research was supported by the National Institutes of Health grant R01 GM59507. Xiang Wan's research was supported by the Hong Kong RGC grant HKBU12202114, the Hong Kong Baptist University grant FRG2/14-15/077, and Hong Kong Baptist University Strategic Development Fund. Tiejun Tong's research was supported in part by Hong Kong Baptist University FRG grants FRG1/14-15/084, FRG2/15-16/019 and FRG2/15-16/038, and the National Natural Science Foundation of China grant (No. 11671338).
PY - 2016/9/13
Y1 - 2016/9/13
N2 - Background: RNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493-2518, 2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated. Results: In this paper, we propose a negative binomial linear discriminant analysis for RNA-Seq data. By Bayes' rule, we construct the classifier by fitting a negative binomial model, and propose some plug-in rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze two real RNA-Seq data sets to demonstrate the advantages of our method in real-world applications. Conclusions: We have developed a new classifier using the negative binomial model for RNA-seq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. The proposed classifier can serve as an effective tool for classifying RNA-seq data. Based on the comparison results, we have provided some guidelines for scientists to decide which method should be used in the discriminant analysis of RNA-Seq data. R code is available at http://www.comp.hkbu.edu.hk/~xwan/NBLDA.R or https://github.com/yangchadam/NBLDA
AB - Background: RNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493-2518, 2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated. Results: In this paper, we propose a negative binomial linear discriminant analysis for RNA-Seq data. By Bayes' rule, we construct the classifier by fitting a negative binomial model, and propose some plug-in rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze two real RNA-Seq data sets to demonstrate the advantages of our method in real-world applications. Conclusions: We have developed a new classifier using the negative binomial model for RNA-seq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. The proposed classifier can serve as an effective tool for classifying RNA-seq data. Based on the comparison results, we have provided some guidelines for scientists to decide which method should be used in the discriminant analysis of RNA-Seq data. R code is available at http://www.comp.hkbu.edu.hk/~xwan/NBLDA.R or https://github.com/yangchadam/NBLDA
KW - Linear discriminant analysis
KW - Negative binomial distribution
KW - RNA-Seq
UR - http://www.scopus.com/inward/record.url?scp=84992159400&partnerID=8YFLogxK
U2 - 10.1186/s12859-016-1208-1
DO - 10.1186/s12859-016-1208-1
M3 - Journal article
C2 - 27623864
AN - SCOPUS:84992159400
SN - 1471-2105
VL - 17
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - 369
ER -