We present a method to approximate Gaussian process regression models for large datasets by considering only a subset of the data. Our approach is novel in that the size of the subset is selected on the fly during exact inference with little computational overhead. From an empirical observation that the log-marginal likelihood often exhibits a linear trend once a sufficient subset of a dataset has been observed, we conclude that many large datasets contain redundant information that only slightly affects the posterior. Based on this, we provide probabilistic bounds on the full model evidence that can identify such subsets. Remarkably, these bounds are largely composed of terms that appear in intermediate steps of the standard Cholesky decomposition, allowing us to modify the algorithm to adaptively stop the decomposition once enough data have been observed.
翻译:我们提出一种方法,通过只考虑数据的一个子集,为大型数据集接近高斯进程回归模型。我们的方法是新颖的,即子集的大小在精确的推论期间在飞行中选择,而计算间接费用很少。根据经验观察,日志边缘的可能性往往显示一个线性趋势,一旦观察到一个数据集的足够子集,我们的结论是,许多大数据集含有冗余信息,仅对后方数据略有影响。在此基础上,我们为能够识别这些子集的完整模型证据提供了概率界限。值得注意的是,这些界限主要由标准Cholesky分解位置的中间步骤中出现的术语组成,使我们能够修改算法,一旦观测到足够的数据,就能够适应性地停止分解。