We propose a novel data-driven semi-confirmatory factor analysis (SCFA) model that addresses the absence of model specification and handles the estimation and inference tasks with high-dimensional data. Confirmatory factor analysis (CFA) is a prevalent and pivotal technique for statistically validating the covariance structure of latent common factors derived from multiple observed variables. In contrast to other factor analysis methods, CFA offers a flexible covariance modeling approach for common factors, enhancing the interpretability of relationships between the common factors, as well as between common factors and observations. However, the application of classic CFA models faces dual barriers: the lack of a prerequisite specification of "non-zero loadings" or factor membership (i.e., categorizing the observations into distinct common factors), and the formidable computational burden in high-dimensional scenarios where the number of observed variables surpasses the sample size. To bridge these two gaps, we propose the SCFA model by integrating the underlying high-dimensional covariance structure of observed variables into the CFA model. Additionally, we offer computationally efficient solutions (i.e., closed-form uniformly minimum variance unbiased estimators) and ensure accurate statistical inference through closed-form exact variance estimators for all model parameters and factor scores. Through an extensive simulation analysis benchmarking against standard computational packages, SCFA exhibits superior performance in estimating model parameters and recovering factor scores, while substantially reducing the computational load, across both low- and high-dimensional scenarios. It exhibits moderate robustness to model misspecification. We illustrate the practical application of the SCFA model by conducting factor analysis on a high-dimensional gene expression dataset.
翻译:暂无翻译