Microarray data profiles gene expression on a whole genome scale, and provides a good way to study associations between gene expression and occurrence or progression of cancer disease. Many researchers realized that microarray data is useful to predict cancer cases. However, the high dimension of gene expressions, which is significantly larger than the sample size, makes this task very difficult. It is very important to identify the significant genes causing cancer. Many feature selection algorithms have been proposed focusing on improving cancer predictive accuracy at the expense of ignoring the correlations between the features. In this work, a novel framework (named by SGS) is presented for significant genes selection and efficient cancer case classification. The proposed framework first performs a clustering algorithm to find the gene groups where genes in each group have higher correlation coefficient, and then selects (1) the significant (2) genes in each group using the Bayesian Lasso method and important gene groups using the group Lasso method, and finally builds a prediction model based on the shrinkage gene space with efficient classification algorithm (such as support vector machine (SVM), 1NN, and regression). Experimental results on public available microarray data show that the proposed framework often outperforms the existing feature selection and prediction methods such as SAM, information gain (IG), and Lasso-type prediction models.
Scopus Subject Areas
- Information Systems
- Computer Science Applications
- cancer prediction
- Gene selection