We present a method to fit exact Gaussian process models to large datasets by considering only a subset of the data. Our approach is novel in that the size of the subset is selected on the fly during exact inference with little computational overhead. From an empirical observation that the log-marginal likelihood often exhibits a linear trend once a sufficient subset of a dataset has been observed, we conclude that many large datasets contain redundant information that only slightly affects the posterior. Based on this, we provide probabilistic bounds on the full model evidence that can identify such subsets. Remarkably, these bounds are largely composed of terms that appear in intermediate steps of the standard Cholesky decomposition, allowing us to modify the algorithm to adaptively stop the decomposition once enough data have been observed. Empirically, we show that our method can be directly plugged into well-known inference schemes to fit exact Gaussian process models to large datasets.
翻译:我们提出一种方法,通过只考虑数据的一个子集,将精确的高斯进程模型与大型数据集相匹配。我们的方法是新颖的,即子集的大小在精确的推论中选择在飞翔上,而没有多少计算间接费用。根据经验观察,对日志边缘的可能性往往显示一个线性趋势,一旦观察到足够多的数据集子集,我们的结论是,许多大型数据集含有冗余信息,仅对后方稍有影响。在此基础上,我们为能够识别此类子集的完整模型证据提供了概率界限。显而易见,这些界限主要由标准Cholesky脱形的中间步骤中出现的术语组成,允许我们修改算法,一旦观察到足够的数据,即可根据适应性地停止分解。我们很生动地表明,我们的方法可以直接插入众所周知的推论方案,使高斯进程模型与大型数据集相匹配。