Unmeasured or latent variables are often the cause of correlations between multivariate measurements and are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVM) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to such high-volume, high-dimensional datasets. We approximate the likelihood using penalized quasi-likelihood and use a Newton method and Fisher scoring to learn the model parameters. Our method greatly reduces the computation time and can be easily parallelized, enabling factorization at unprecedented scale using commodity hardware. We illustrate application of our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit, finding that most of the variability can be explained with a handful of factors.
翻译:无法计量或潜在的变量往往是多种变量之间相互关系的原因,这些变量在心理学、生态学和医学等各个领域都得到了研究。关于高斯测量,有一些古典工具,如要素分析或主要组成部分分析,并附有完善的理论和快速算法。通用的线性冷淡变量模型(GLLVM)将此类要素模型概括为非加西人的反应。然而,目前用于估算GLLVM中模型参数的算法需要大量计算,而不是以数千个观测单位或反应对大型数据集进行比例化研究。在本篇文章中,我们提出了将GLLLVMs与这种高容量、高容量数据集相匹配的新方法。我们估计了使用惩罚的准相似性方法、使用Newton方法和Fisher评分来学习模型参数的可能性。我们的方法大大缩短了计算时间,并且可以容易地加以平行,从而能够使用商品硬件在前所未有的规模上实现因子化。我们的方法在48,000个观测单位和每个单位所观测到的2 000多个物种的数据集上的应用情况。我们发现,大多数变异性因素可以用一些因素来解释。