Handling big data has largely been a major bottleneck in traditional statistical models. Consequently, when accurate point prediction is the primary target, machine learning models are often preferred over their statistical counterparts for bigger problems. But full probabilistic statistical models often outperform other models in quantifying uncertainties associated with model predictions. We develop a data-driven statistical modeling framework that combines the uncertainties from an ensemble of statistical models learned on smaller subsets of data carefully chosen to account for imbalances in the input space. We demonstrate this method on a photometric redshift estimation problem in cosmology, which seeks to infer a distribution of the redshift -- the stretching effect in observing far-away galaxies -- given multivariate color information observed for an object in the sky. Our proposed method performs balanced partitioning, graph-based data subsampling across the partitions, and training of an ensemble of Gaussian process models.
翻译:处理大数据在很大程度上一直是传统统计模型中的一个主要瓶颈。 因此,当准确点预测是主要目标时,机器学习模型往往比统计模型更倾向于处理更大的问题。 但是,完全概率统计模型在量化与模型预测有关的不确定因素方面往往优于其他模型。 我们开发了一个数据驱动统计模型框架,将从谨慎选择的较小数据子集中学习的统计模型的不确定性结合起来,以计算输入空间的不平衡。 我们展示了这一方法在宇宙学中以光度计红位估计问题,该方法试图推断红位的分布 -- -- 观察远方星系的延伸效应 -- -- 给天上的一个对象观测到的多变颜色信息。我们提议的方法是平衡的分区、基于图形的数据分采样跨分区,以及培训高斯进程模型的组合。