Test-time augmentation -- the aggregation of predictions across transformed versions of a test input -- is a common practice in image classification. Traditionally, predictions are combined using a simple average. In this paper, we present 1) experimental analyses that shed light on cases in which the simple average is suboptimal and 2) a method to address these shortcomings. A key finding is that even when test-time augmentation produces a net improvement in accuracy, it can change many correct predictions into incorrect predictions. We delve into when and why test-time augmentation changes a prediction from being correct to incorrect and vice versa. Building on these insights, we present a learning-based method for aggregating test-time augmentations. Experiments across a diverse set of models, datasets, and augmentations show that our method delivers consistent improvements over existing approaches.
翻译:测试时间增强 -- -- 测试输入的变换版本的预测汇总 -- -- 是图像分类的一个常见做法。传统上,预测是使用简单的平均数组合在一起的。在本文中,我们提出:(1) 实验性分析,揭示简单平均数低于最佳值的案例,(2) 解决这些缺陷的方法。一个关键发现是,即使测试时间增强产生准确性的净提高,它也可以将许多正确的预测改变为不正确的预测。我们深入探讨测试时间增强何时和为什么使预测从正确变为不正确,反之亦然。基于这些洞察,我们提出了一个基于学习的方法来汇总测试时间增强值。在一系列不同的模型、数据集和增强值中进行的实验表明,我们的方法比现有方法提供了一致的改进。