Gaussian process (GP) regression in large-data contexts, which often arises in surrogate modeling of stochastic simulation experiments, is challenged by cubic runtimes. Coping with input-dependent noise in that setting is doubly so. Recent advances target reduced computational complexity through local approximation (e.g., LAGP) or otherwise induced sparsity. Yet these do not economically accommodate a common design feature when attempting to separate signal from noise. Replication can offer both statistical and computational efficiencies, motivating several extensions to the local surrogate modeling toolkit. Introducing a nugget into a local kernel structure is just the first step. We argue that a new inducing point formulation (LIGP), already preferred over LAGP on the speed-vs-accuracy frontier, conveys additional advantages when replicates are involved. Woodbury identities allow local kernel structure to be expressed in terms of unique design locations only, increasing the amount of data (i.e., the neighborhood size) that may be leveraged without additional flops. We demonstrate that this upgraded LIGP provides more accurate prediction and uncertainty quantification compared to several modern alternatives. Illustrations are provided on benchmark data, real-world simulation experiments on epidemic management and ocean oxygen concentration, and in an options pricing control framework.
翻译:大型数据环境中的Gausian进程(GP)回归(GP)通常产生于代用模拟模拟模拟实验的替代模型,在大数据环境下,这种回归往往会受到立方运行时间的挑战。在这种环境下,使用以投入为主的噪音是双重的。最近的进步目标通过本地近似(如LAGP)或其他诱发的偏差降低了计算复杂性。然而,这些在经济上无法在试图将信号与噪音分开时包含一个共同的设计特征。复制可以提供统计和计算效率,鼓励将本地代用模型工具包的若干扩展。在本地内核结构中引入一个孵化器只是第一步。我们认为,在速度-V-准确度边界上已经比LAGP偏好的新导点配方(LIGP),在复制时会带来额外的优势。 Woodbury 身份允许仅用独特的设计地点来表达本地内核结构,增加数据的数量(即社区规模),而无需额外的软体即可加以利用。我们证明,这一升级的LIGPP提供了比一些现代数据模型和海洋控制模型框架的更精确的预测和不确定性量化。我提供了一种数据。