Data-fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data-fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging given the missing-data pattern. Factor analysis models use lower-dimensional latent variables in the data-generating process, and this introduces low-rank components in the complete-data matrix and the population covariance matrix. The low-rank structure of the factor analysis model can be exploited to estimate the full covariance matrix from incomplete data via low-rank matrix completion. We prove the identifiability of the factor analysis model in the statistical file-matching problem under conditions on the number of factors and the number of shared variables over the observed marginal subsets. Additionally, we provide an EM algorithm for parameter estimation. On several real datasets, the factor model gives smaller reconstruction errors in file-matching problems than the common approaches for low-rank matrix completion.
翻译:统计文件匹配问题是多变量分析中的一个粗体数据聚合问题,目的是在只观察到严格的边际分布子集时,对一组变量的联合分布进行定性。根据缺失的数据模式,对全套变量的共变量矩阵进行估计具有挑战性。系数分析模型在数据生成过程中使用低维潜伏变量,这在完整数据矩阵和人口共变矩阵中引入了低位部分。系数分析模型的低位结构可以用来通过低级别矩阵完成来估计不完整数据的完整共变矩阵。我们证明统计文档匹配问题中要素分析模型在所观察到的边际子组数和共享变量数条件下的可识别性。此外,我们提供了参数估算的 EM 算法。在几个真实的数据集中,系数模型在文件匹配问题上的重置错误比低级矩阵完成的通用方法要小。