Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.
翻译:动态基准间接模型的安装和数据收集,以试图减轻静态基准的限制。与对静态环境的广泛理论和经验研究相比,动态对应方由于经验研究有限,迄今没有明显的理论基础而落后。针对这一缺陷,我们开始对动态基准进行理论研究。我们研究了两种认识,一种是当前做法,另一种是更复杂的模型。在第一个模型中,数据收集和模型依次交替,我们证明模型的性能最初有所改善,但只能停留在三轮之后。例如,由于通知人意见不一而产生的拉贝尔噪音导致更强烈的负面结果。我们的第二个模型将第一个模型概括到数据收集和模型安装具有等级依赖性结构的情况。我们表明,这种设计保证的进展严格多于第一个模型,尽管复杂性大大增加。我们支持我们的理论分析,方法是模拟两个流行数据集的动态基准。这些结果说明了动态基准的效益和实际限制,为经验工作中观察到的瓶颈提供了理论基础和因果关系解释。