We introduce the notion of a reproducible algorithm in the context of learning. A reproducible learning algorithm is resilient to variations in its samples -- with high probability, it returns the exact same output when run on two samples from the same underlying distribution. We begin by unpacking the definition, clarifying how randomness is instrumental in balancing accuracy and reproducibility. We initiate a theory of reproducible algorithms, showing how reproducibility implies desirable properties such as data reuse and efficient testability. Despite the exceedingly strong demand of reproducibility, there are efficient reproducible algorithms for several fundamental problems in statistics and learning. First, we show that any statistical query algorithm can be made reproducible with a modest increase in sample complexity, and we use this to construct reproducible algorithms for finding approximate heavy-hitters and medians. Using these ideas, we give the first reproducible algorithm for learning halfspaces via a reproducible weak learner and a reproducible boosting algorithm. Finally, we initiate the study of lower bounds and inherent tradeoffs for reproducible algorithms, giving nearly tight sample complexity upper and lower bounds for reproducible versus nonreproducible SQ algorithms.
翻译:我们在学习的背景下引入可再生算法的概念。在同一潜在分布的两个样本上运行时,可再生学习算法可以抵御样本变化——具有高概率时,当它返回完全相同的输出。我们从阐明定义开始,澄清了随机性在平衡准确性和可重复性方面的重要作用。我们启动了可再生算法的理论,展示了可再生性可带来合适的属性,如数据重用和高效性测试。尽管可再生性的需求极为强烈,但是有几个基本的统计和学习问题有高效的可再生算法。首先,我们展示了任何统计查询算法都可以通过适度增加样本复杂度而被制成可再生的,我们利用这点构建了查找近似重磅回答和中位数的可再生算法。应用这些想法,我们提供了第一个通过可再生微弱学习者和可再生升级算法的学习半空间的可再生算法。最后,我们开始研究可再生算法的下界和内在权衡,给出了在可再生与不可再生 SQ 算法之间的样本复杂度的上下限。