Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.
翻译:无法计量或潜在的变量往往是多种变量测量之间相互关联的原因,这些测量是在心理学、生态学和医学等各个领域研究的。对于高山测量,有一些古典工具,如要素分析或主要组成部分分析,并具有完善的理论和快速算法。通用线性中子边端变量模型(GLLVMS)将此类要素模型一般化为非加西人反应。然而,目前估算GLLLVM中模型参数的算法需要进行密集计算,而不是对有数千个观测单位或反应的大型数据集进行缩放。在本篇文章中,我们提出一种新的方法,将GLLLVMS与高维数据集相匹配,其依据是使用惩罚的准相似性和快速算法,然后使用Newton法和Fishertel评分来学习模型参数。计算,我们的方法明显更快、更稳定,使GLLVMM能够适应比以往可能要大得多的矩阵。我们的方法适用于一个有48,000个观测单位,每个观察物种超过2,000个的观察单位的数据集。我们发现,可以很容易地解释我们大多数变异性因素的算出一个小的算法。