有多少数据足以学习高性能算法? 数据驱动算法设计的普遍化保障 (How much data is sufficient to learn high-performing algorithms? Generalization guarantees for data-driven algorithm design)

Algorithms often have tunable parameters that impact performance metrics such as runtime and solution quality. For many algorithms used in practice, no parameter settings admit meaningful worst-case bounds, so the parameters are made available for the user to tune. Alternatively, parameters may be tuned implicitly within the proof of a worst-case approximation ratio or runtime bound. Worst-case instances, however, may be rare or nonexistent in practice. A growing body of research has demonstrated that data-driven algorithm design can lead to significant improvements in performance. This approach uses a training set of problem instances sampled from an unknown, application-specific distribution and returns a parameter setting with strong average performance on the training set. We provide a broadly applicable theory for deriving generalization guarantees that bound the difference between the algorithm's average performance over the training set and its expected performance. Our results apply no matter how the parameters are tuned, be it via an automated or manual approach. The challenge is that for many types of algorithms, performance is a volatile function of the parameters: slightly perturbing the parameters can cause large changes in behavior. Prior research has proved generalization bounds by employing case-by-case analyses of greedy algorithms, clustering algorithms, integer programming algorithms, and selling mechanisms. We uncover a unifying structure which we use to prove extremely general guarantees, yet we recover the bounds from prior research. Our guarantees apply whenever an algorithm's performance is a piecewise-constant, -linear, or -- more generally -- piecewise-structured function of its parameters. Our theory also implies novel bounds for voting mechanisms and dynamic programming algorithms from computational biology.

翻译：解算法通常具有可以影响运行时间和解决方案质量等性能衡量尺度的可计量参数。对于实践中使用的许多算法来说,没有一个参数设置会接受有意义的最坏情况界限,因此参数可以让用户调试。或者,参数可以在最坏情况近似比率或运行时间约束的证明中暗含调整。但是,最坏情况可能很少或在实践中不存在。越来越多的研究表明,数据驱动算法设计可以导致业绩的显著改进。这个方法使用一组从未知的、具体应用的分布和返回一个在培训集中具有很强平均性能的参数设置来抽样的问题实例。我们提供了一个广泛适用的理论,用于得出通用的保证参数在最坏情况接近比率比率或运行时间约束。但是,我们的结果并不涉及参数是如何调整的,无论是通过自动化还是人工的方法。挑战在于,对于许多类型的算法来说,性能是一个不稳定的参数函数:稍过粗的参数可以导致行为上的大幅变化。以往的研究一旦确定了通用的参数设置,我们通过采用常规的算法分析,我们就会用一个稳定的算法分析,我们用来进行一个稳定性分析。