Gaussian processes are widely used as priors for unknown functions in statistics and machine learning. To achieve computationally feasible inference for large datasets, a popular approach is the Vecchia approximation, which is an ordered conditional approximation of the data vector that implies a sparse Cholesky factor of the precision matrix. The ordering and sparsity pattern are typically determined based on Euclidean distance of the inputs or locations corresponding to the data points. Here, we propose instead to use a correlation-based distance metric, which implicitly applies the Vecchia approximation in a suitable transformed input space. The correlation-based algorithm can be carried out in quasilinear time in the size of the dataset, and so it can be applied even for iterative inference on unknown parameters in the correlation structure. The correlation-based approach has two advantages for complex settings: It can result in more accurate approximations, and it offers a simple, automatic strategy that can be applied to any covariance, even when Euclidean distance is not applicable. We demonstrate these advantages in several settings, including anisotropic, nonstationary, multivariate, and spatio-temporal processes. We also illustrate our method on multivariate spatio-temporal temperature fields produced by a regional climate model.
翻译:高斯过程被广泛应用于统计学和机器学习中的未知函数先验。为了实现大数据集的可计算推断,一种流行的方法是Vecchia近似,它是数据向量的有序条件近似,暗示一个精度矩阵的稀疏Cholesky因子。通常使用基于欧几里得距离的输入或对应于数据点的位置来确定排序和稀疏模式。在这里,我们建议改为使用基于相关性的距离度量,在合适的转换输入空间中隐式地应用Vecchia近似。基于相关性的算法在数据集的大小上可以在准线性时间内执行,因此甚至可应用于相关性结构中未知参数的迭代推断。相比于复杂情况下的欧几里得距离方法,基于相关性的方法具有两个优点:可以产生更准确的近似,且提供了一种简单的自动策略,适用于任何协方差,即使欧几里得距离不适用。我们在数个场景中演示了这些优点,包括各向异性的、非平稳的、多元的和空间时间的过程。我们还在区域气候模型产生的多元空间时间温度场上说明了我们的方法。