Test-time augmentation (TTA)---the aggregation of predictions across transformed versions of a test input---is a common practice in image classification. In this paper, we present theoretical and experimental analyses that shed light on 1) when test time augmentation is likely to be helpful and 2) when to use various test-time augmentation policies. A key finding is that even when TTA produces a net improvement in accuracy, it can change many correct predictions into incorrect predictions. We delve into when and why test-time augmentation changes a prediction from being correct to incorrect and vice versa. Our analysis suggests that the nature and amount of training data, the model architecture, and the augmentation policy all matter. Building on these insights, we present a learning-based method for aggregating test-time augmentations. Experiments across a diverse set of models, datasets, and augmentations show that our method delivers consistent improvements over existing approaches.
翻译:测试时间增强(TTA) -- -- 测试输入转换版本的预测汇总 -- -- 是一种常见的图像分类做法。在本文中,我们介绍理论和实验分析,这些分析揭示了:(1) 当测试时间增强可能有所帮助时, 和(2) 当使用各种测试时间增强政策时。一个关键发现是,即使TTA在精确度方面产生净改善,它也可以将许多正确的预测改变为不正确的预测。我们深入研究试验时间增强何时和为什么将预测从正确变为不正确,反之亦然。我们的分析表明培训数据、模型结构以及增强政策的性质和数量都很重要。基于这些洞察,我们提出了一个基于学习的方法来综合测试时间增强。在一系列不同的模型、数据集和增强方法上进行的实验表明,我们的方法可以对现有方法作出一致的改进。