Gaussian processes are ubiquitous as the primary tool for modeling spatial data. However, the Gaussian process is limited by its $\mathcal{O}(n^3)$ cost, making direct parameter fitting algorithms infeasible for the scale of modern data collection initiatives. The Nearest Neighbor Gaussian Process (NNGP) was introduced as a scalable approximation to dense Gaussian processes which has been successful for $n\sim 10^6$ observations. This project introduces the $\textit{clustered Nearest Neighbor Gaussian Process}$ (cNNGP) which reduces the computational and storage cost of the NNGP. The accuracy of parameter estimation and reduction in computational and memory storage requirements are demonstrated with simulated data, where the cNNGP provided comparable inference to that obtained with the NNGP, in a fraction of the sampling time. To showcase the method's performance, we modeled biomass over the state of Maine using data collected by the Global Ecosystem Dynamics Investigation (GEDI) to generate wall-to-wall predictions over the state. In 16% of the time, the cNNGP produced nearly indistinguishable inference and biomass prediction maps to those obtained with the NNGP.
翻译:高斯过程作为建模空间数据的主要工具无处不在。然而,高斯过程受限于其 $\mathcal{O}(n^3)$ 的计算成本,使得直接参数拟合算法对于现代数据收集计划的规模而言不可行。最近邻高斯过程(NNGP)作为一种对密集高斯过程的可扩展近似被提出,已成功应用于 $n\sim 10^6$ 量级的观测数据。本项目引入了 $\textit{聚类最近邻高斯过程}$(cNNGP),它降低了 NNGP 的计算和存储成本。通过模拟数据证明了参数估计的准确性以及计算和内存存储需求的降低,其中 cNNGP 在远少于 NNGP 的采样时间内提供了与之相当的推断结果。为了展示该方法的性能,我们利用全球生态系统动力学调查(GEDI)收集的数据对缅因州全境的生物量进行建模,以生成覆盖该州的连续预测图。在仅需 16% 的时间内,cNNGP 产生的推断结果和生物量预测图与使用 NNGP 获得的结果几乎无法区分。