Nonparametric variable selection and semiparametric structure identification in high dimension

Project: Research project

Project Details


High-dimensional data are increasingly common in biology, finance, genetics, medicine and others, thanks to the rapid advancements in the data collection, storage and preprocessing techniques. There has been a fast growing literature on semiparametric and nonparametric statistical modeling of such kind of data, and the conclusions and inference based on them play important roles in the subject areas. One example is a recent dataset concerning environmental risks and genetic factors, and their interactions on adult asthma. We focus on nonparametric variable selection and semiparametric structure identification problems when we are interested in modeling the relationship between a response variable and a large number of candidate covariates. The unknown regression function of such high dimensional data are often complex. Semiparametric regression models are parsimonious, flexible, interpretable and efficient. Feature selection and structure identification are crucial steps in semiparametric modeling. When the dimensionality is large, these become challenging problems. Existing literature specifies a particular semiparametric regression model, such as additive model, varying coefficient model and semivarying coefficient model, and assume it is sparse, and then build variable selection and structure identification procedures to come up with the final low-dimensional semiparametric regression model. However, such an approach may not be feasible in practice due to the unknown complex structure; the asthma data is an example. On the other hand, variable selection under a parametric model may not work because of mis-specification of the parametric form. And, variable selection under the fully nonparametric regression model is a notoriously difficult problem, and available methods are shown to only scale to dimensions that grow logarithmically in the sample size. In this research project, we investigate nonparametric variable selection using a functional analysis of variance model with only main effects and two-way interactions as a working model, and develop nonparametric variable selection and semiparametric structure identification procedures. For these purposes, we introduce new, efficient Gaussian process regression estimators. Theoretical justifications for the proposed methods will be derived. Monte Carlo simulation studies will be conducted to examine their finite sample performance. The proposed methods will be applied to the asthma dataset and some other datasets arising from econometric, environmental and genetic studies. In particular, the results from application to the asthma data will shed some light on further research on treatment and even personalized medicine for asthma, a disease phenotypically heterogeneous with unclear etiology.
StatusNot started
Effective start/end date1/01/2531/12/27


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.