## Project Details

### Description

With the development of domain scientific fields, available data sets grow rapidly in size and scope in biomedical, engineering and social behavioral sciences. This new trend has created a challenge to data analysis in trading off between the statistical inference accuracy and computational complexity. When dealing with large scale datasets, many established statistical methods are considered as impractical due to their theoretical computational complexity when ignoring the potential advantages in comput- ing structure and programming techniques from computer science’s perspectives. On the other hand, for classical statistical modeling, especially those complex semiparametric regression or nonparametric regression modeling problems, research literature has been focusing on how to make trade-off between the bias and variance of the model estimation, and finding optimal efficient estimates or inferences for those complex models. In order to obtain optimal efficient estimates or inferences, model building has to base on many restrictive mathematical conditions and complicated estimating or inference procedures. Such set up often ignores the price of the complexity and stability of computation. Hence it narrows the application ranges of those so-called optimal efficient statistical modeling methods in practice. To make good trade-off between statistical inference and computation, as shown by Chanrasekaran and Jordan (2013), we need a novel foundational perspective that blend statistics and computer science. This pro- posal is motivated by this new view, and it aims to bring in new insights for the trade off between the statistical inference and computational complexity.

The computational structure of the computer is based on the 0-1 Boolean operation. The speed of 0-1 Boolean operation is much faster than other computing operations, such as the floating point computing. The first of goal of this proposal is to investigate how to apply 0-1 Boolean operation, the basic and simplest computer computing operation in the algorithms of statistical modeling to improve the computing speed. As a result, many statistical modeling tasks, which can not, or are difficult to be done by normal computer resources from theoretical insights, can now be accomplished in practice. In most situations, observations are continuous. 0-1 Boolean operation can not be easily applied to perform direct computation. However, if we discretize the data, 0-1 Boolean operation can be applied to speed statistical modeling procedure. In this proposal, we consider specifically how to use discretization methods for the homogeneous problem of categorical data modeling, and sure independence screening methods, especially the screening method of the pure interaction effects for ultra-high dimensional data. After discretization, our algorithm and the proposed software package based on the bit operation can in practice accomplish full interaction effects screening tasks for most of real ultra-high dimensional data.

In many complex classical semiparametric or nonparametric regression models where the parametric component is of the central interest, the nonparametric components are often used to reduce model bias. In such models, efficient estimation and inference for the parametric components is an important task in modeling. Following the idea of partial consistency (Neyman and Scott 1943), we treat the parametric components as the structure parameter components in the model, and the nonparametric components in the model as incidental parameter components. Step functions or other discretization methods will be use to approximate the nonparametric components unbiasedly. This way we change the models into high dimensional parametric models. The efficient algorithm can be used to estimate them. The efficiency and asymptotic normality of those parametric components will be investigated. Inference methods based on such estimates will also be studied. Furthermore, we also show that the nonparametric components can be also estimated nearly optimal efficiently by two step estimating procedures which are based on the almost unbiased, but inconsistent nonparametric component estimation in the first step.

The novelty of this research is the development of a statistically efficient and computationally feasible framework based on two novel foundational perspectives. One is that by discretization, the advantages of the computer structure can be fully utilized. The second is that by using the partial consistent phenomenon, the computation complexity of statistical modeling can be greatly reduced, while the efficiency of the model estimates is almost preserved.

The computational structure of the computer is based on the 0-1 Boolean operation. The speed of 0-1 Boolean operation is much faster than other computing operations, such as the floating point computing. The first of goal of this proposal is to investigate how to apply 0-1 Boolean operation, the basic and simplest computer computing operation in the algorithms of statistical modeling to improve the computing speed. As a result, many statistical modeling tasks, which can not, or are difficult to be done by normal computer resources from theoretical insights, can now be accomplished in practice. In most situations, observations are continuous. 0-1 Boolean operation can not be easily applied to perform direct computation. However, if we discretize the data, 0-1 Boolean operation can be applied to speed statistical modeling procedure. In this proposal, we consider specifically how to use discretization methods for the homogeneous problem of categorical data modeling, and sure independence screening methods, especially the screening method of the pure interaction effects for ultra-high dimensional data. After discretization, our algorithm and the proposed software package based on the bit operation can in practice accomplish full interaction effects screening tasks for most of real ultra-high dimensional data.

In many complex classical semiparametric or nonparametric regression models where the parametric component is of the central interest, the nonparametric components are often used to reduce model bias. In such models, efficient estimation and inference for the parametric components is an important task in modeling. Following the idea of partial consistency (Neyman and Scott 1943), we treat the parametric components as the structure parameter components in the model, and the nonparametric components in the model as incidental parameter components. Step functions or other discretization methods will be use to approximate the nonparametric components unbiasedly. This way we change the models into high dimensional parametric models. The efficient algorithm can be used to estimate them. The efficiency and asymptotic normality of those parametric components will be investigated. Inference methods based on such estimates will also be studied. Furthermore, we also show that the nonparametric components can be also estimated nearly optimal efficiently by two step estimating procedures which are based on the almost unbiased, but inconsistent nonparametric component estimation in the first step.

The novelty of this research is the development of a statistically efficient and computationally feasible framework based on two novel foundational perspectives. One is that by discretization, the advantages of the computer structure can be fully utilized. The second is that by using the partial consistent phenomenon, the computation complexity of statistical modeling can be greatly reduced, while the efficiency of the model estimates is almost preserved.

Status | Finished |
---|---|

Effective start/end date | 1/01/19 → 30/06/22 |

## Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.