Harnessing distributed computing environments to build scalable inference algorithms for very large data sets is a core challenge across the broad mathematical sciences. Here we provide a theoretical framework to do so along with fully implemented examples of scalable algorithms with performance guarantees. We begin by formalizing the class of statistics which admit straightforward calculation in such environments through independent parallelization. We then show how to use such statistics to approximate arbitrary functional operators, thereby providing practitioners with a generic approximate inference procedure that does not require data to reside entirely in memory. We characterize the $L^2$ approximation properties of our approach, and then use it to treat two canonical examples that arise in large-scale statistical analyses: sample quantile calculation and local polynomial regression. A variety of avenues and extensions remain open for future work.
翻译:利用分布式计算环境为非常庞大的数据集建立可缩放的推算算法是整个广泛数学科学中的一项核心挑战。 我们在此提供了一个理论框架来这样做,同时充分运用具有性能保障的可缩放算法实例。 我们首先将允许在这种环境中通过独立平行进行简单计算的统计数据类别正规化。 然后我们展示如何将这类统计数据用于接近任意的功能操作者,从而为从业人员提供一个通用的近似推论程序,该程序并不要求数据完全包含在记忆中。 我们描述我们方法的近似特性,然后用它来处理大规模统计分析中出现的两个卡通例子:抽样量计算和局部多面回归。 各种途径和扩展仍然可供未来工作使用。