We develop a scalable class of models for latent variable estimation using composite Gaussian processes, with a focus on derivative Gaussian processes. We jointly model multiple data sources as outputs to improve the accuracy of latent variable inference under a single probabilistic framework. Similarly specified exact Gaussian processes scale poorly with large datasets. To overcome this, we extend the recently developed Hilbert space approximation methods for Gaussian processes to obtain a reduced-rank representation of the composite covariance function through its spectral decomposition. Specifically, we derive and analyze the spectral decomposition of derivative covariance functions and further study their properties theoretically. Using these spectral decompositions, our methods easily scale up to data scenarios involving thousands of samples. We validate our methods in terms of latent variable estimation accuracy, uncertainty calibration, and inference speed across diverse simulation scenarios. Finally, using a real world case study from single-cell biology, we demonstrate the potential of our models in estimating latent cellular ordering given gene expression levels, thus enhancing our understanding of the underlying biological process.
翻译:我们开发了一类可扩展的模型,用于利用复合高斯过程进行潜变量估计,重点关注导数高斯过程。我们在单一概率框架下,将多个数据源联合建模为输出,以提高潜变量推断的准确性。类似设定的精确高斯过程在处理大规模数据集时扩展性较差。为解决此问题,我们将近期发展的高斯过程希尔伯特空间近似方法扩展至复合协方差函数,通过其谱分解获得降秩表示。具体而言,我们推导并分析了导数协方差函数的谱分解,并进一步从理论上研究其性质。利用这些谱分解,我们的方法可轻松扩展至涉及数千个样本的数据场景。我们在多种模拟场景中,从潜变量估计精度、不确定性校准和推断速度等方面验证了方法的有效性。最后,通过单细胞生物学中的真实案例研究,我们展示了模型在给定基因表达水平下估计潜在细胞排序的潜力,从而增强对基础生物学过程的理解。