Statistical Modeling of Empirical Distribution Transforms via Wasserstein Metrics

Project: Research project

Project Details


Statistics thrives on the increasing demand to model complex data collected in various domains, such as social sciences, public health, computer vision, etc. In addressing data, empirical distributions appear everywhere considering the direct connection with matrix: the collection of design points in a dataset is an empirical distribution, image data can be represented as empirical distributions, and even neural network parameters can be modeled as empirical distributions. Intentionally or unintentionally, statisticians and practitioners keep creating, modeling, and utilizing empirical distributions; it is therefore vital to study the statistical modeling of empirical distribution transforms (EDTs), for better understanding and utilizing them in real-world applications.

However, the complex structure of distribution space impedes the studies on EDT. For example, most deterministic design configuration methods proposed for clinical experiments lack population risk analysis due to the non-IIDness of empirical distributions. Meanwhile, despite the popularity of distributional techniques in statistical models, their potential is still far from being fully realized in deep learning applications.

We are thus motivated to devote this project to exploring various aspects of EDTs, with the hope of making tangible and influential progress both in theory and in practice.

• In the first task, we establish the regression model in Wasserstein space (regress response distributions on input predictor distributions) for multivariate distributions by adapting nonparametric local regression, which enforces no requirements on distribution dimensions. We therefore extend previous univariate distributional regression and allow the regression model to address more complex data, such as images.
• In the second task, we focus on the population risk analysis for non-IID designs generated from EDTs. Even though the non-IID designs weakly converge to the underlying distribution, population risk, compared to in-sample risk, is rarely studied in literature due to technical challenges caused by non-IIDness. We will leverage Wasserstein metrics to upper bound generalization error for non-IID designs, and additionally apply the technique to analyze a special while useful transform—dataset compression.
• In the last task, we propose a new EDT to compress Mixture-of-Experts (MoE) transformers, a powerful language model architecture adopted by Google and OpenAI. It will be especially encouraging if the EDT framework enables to address a large language model considering its broad usage. We will further collaborate with Microsoft researchers to develop hardwareaware implementations for the proposed distributional algorithm.

Combining all the pieces together, we believe this project can inspire substantial progress towards more principled understanding and modeling of empirical distribution transforms.

Keywords: Wasserstein Distance, Distributional Regression, Non-IID Design Risk Analysis, Model Compression.
StatusNot started
Effective start/end date1/01/2531/12/27


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.