TY - JOUR
T1 - The iterated score regression estimation algorithm for PCA-based missing data with high correlation
AU - Guo, Guangbao
AU - Song, Haoyue
AU - Zhu, Lixing
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/3/17
Y1 - 2025/3/17
N2 - To handle principal component analysis (PCA)-based missing data with high correlation, we propose a novel imputation algorithm to impute missing values, called iterated score regression. The procedure is first to draw into a transformation matrix, which puts missing values and observed values into two data blocks, and then by using the data blocks, the score matrix, and PCA model to construct the related regression equations. The estimation update at the iteration is highlighted. We examine the sensitivity of the proposed algorithm, including the effects of standard deviations, correlation coefficients, missing proportions, variable numbers, and sample sizes with different intervals of the standard deviations and correlation coefficients. To compare some existing algorithms, we suggest the modifications of three popularly used algorithms that are also used to deal with missing data but are not highly correlated. In the numerical studies we conducted, the MSE values of the algorithm, to show its stability and accuracy, are always the smallest among the competitors we consider. It also shows the advantage, as the illustration, for three real missing data sets.
AB - To handle principal component analysis (PCA)-based missing data with high correlation, we propose a novel imputation algorithm to impute missing values, called iterated score regression. The procedure is first to draw into a transformation matrix, which puts missing values and observed values into two data blocks, and then by using the data blocks, the score matrix, and PCA model to construct the related regression equations. The estimation update at the iteration is highlighted. We examine the sensitivity of the proposed algorithm, including the effects of standard deviations, correlation coefficients, missing proportions, variable numbers, and sample sizes with different intervals of the standard deviations and correlation coefficients. To compare some existing algorithms, we suggest the modifications of three popularly used algorithms that are also used to deal with missing data but are not highly correlated. In the numerical studies we conducted, the MSE values of the algorithm, to show its stability and accuracy, are always the smallest among the competitors we consider. It also shows the advantage, as the illustration, for three real missing data sets.
KW - High correlation
KW - Iterated score regression
KW - Missing data
KW - Principal component analysis
KW - Sensitivity
UR - http://www.scopus.com/inward/record.url?scp=105000108247&partnerID=8YFLogxK
U2 - 10.1038/s41598-025-93333-6
DO - 10.1038/s41598-025-93333-6
M3 - Journal article
C2 - 40097527
AN - SCOPUS:105000108247
SN - 2045-2322
VL - 15
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 9067
ER -