The world of empirical machine learning (ML) strongly relies on benchmarks in order to determine the relative effectiveness of different algorithms and methods. This paper proposes the notion of "a benchmark lottery" that describes the overall fragility of the ML benchmarking process. The benchmark lottery postulates that many factors, other than fundamental algorithmic superiority, may lead to a method being perceived as superior. On multiple benchmark setups that are prevalent in the ML community, we show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks, highlighting the fragility of the current paradigms and potential fallacious interpretation derived from benchmarking ML methods. Given that every benchmark makes a statement about what it perceives to be important, we argue that this might lead to biased progress in the community. We discuss the implications of the observed phenomena and provide recommendations on mitigating them using multiple machine learning domains and communities as use cases, including natural language processing, computer vision, information retrieval, recommender systems, and reinforcement learning.
翻译:实证机学(ML)世界强烈依赖基准,以确定不同算法和方法的相对有效性。本文件提出“基准彩票”概念,描述ML基准进程的总体脆弱性。基准彩票假设,除基本算法优越性外,许多因素可能导致一种被认为优越的方法。关于ML社区普遍存在的多种基准设置,我们表明,仅通过选择不同的基准任务,可以显著改变算法的相对性能,突出当前模式的脆弱性和从基准ML方法中得出的潜在谬误解释。鉴于每个基准都说明其认为重要的内容,我们认为这可能导致社区出现偏差的进展。我们讨论观察到的现象的影响,并提出建议,说明如何利用多个机器学习领域和社区来减轻这些现象,包括自然语言处理、计算机视觉、信息检索、推荐系统以及强化学习等案例。