TY - JOUR
T1 - SNP selection and classification of genome-wide SNP data using stratified sampling random forests
AU - Wu, Qingyao
AU - Ye, Yunming
AU - Liu, Yang
AU - Ng, Kwok Po
N1 - Funding Information:
Manuscript received July 03, 2012; accepted August 01, 2012. Date of current version September 10, 2012. The work of Y. Ye was supported in part by NSFC under Grant 61073195, and Shenzhen Science and Technology Program under Grant CXB201005250024A. The work of M. K. Ng was supported in part by the Centre for Mathematical Imaging and Vision, HKRGC under Grant 201812 and HKBU FRG Grant. Asterisk indicates corresponding author.
PY - 2012/9
Y1 - 2012/9
N2 - For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408803 SNPs and Alzheimer case-control data comprised of 380157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.
AB - For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408803 SNPs and Alzheimer case-control data comprised of 380157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.
KW - Genome-wide association study
KW - random forest
KW - SNP
KW - stratified sampling
UR - http://www.scopus.com/inward/record.url?scp=84866484354&partnerID=8YFLogxK
U2 - 10.1109/TNB.2012.2214232
DO - 10.1109/TNB.2012.2214232
M3 - Journal article
C2 - 22987127
AN - SCOPUS:84866484354
SN - 1536-1241
VL - 11
SP - 216
EP - 227
JO - IEEE Transactions on Nanobioscience
JF - IEEE Transactions on Nanobioscience
IS - 3
M1 - 6298047
ER -