There is evidence that transformers offer state-of-the-art recognition performance on tasks involving overhead imagery (e.g., satellite imagery). However, it is difficult to make unbiased empirical comparisons between competing deep learning models, making it unclear whether, and to what extent, transformer-based models are beneficial. In this paper we systematically compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery. Each model is given a similar budget of free parameters, and their hyperparameters are optimized using Bayesian Optimization with a fixed quantity of data and computation time. We conduct our experiments with a large and diverse dataset comprising two large public benchmarks: Inria and DeepGlobe. We perform additional ablation studies to explore the impact of specific transformer-based modeling choices. Our results suggest that transformers provide consistent, but modest, performance improvements. We only observe this advantage however in hybrid models that combine convolutional and transformer-based structures, while fully transformer-based models achieve relatively poor performance.
翻译:有证据表明,变压器在涉及高空图像的任务(例如卫星图像)上提供了最先进的识别性表现。然而,很难对相互竞争的深层学习模型进行不偏不倚的实证比较,从而弄清楚以变压器为基础的模型是否以及在何种程度上是有益的。在本文中,我们系统地比较了将变压器结构纳入高空图像最新分解模型的影响。每个模型都有类似的自由参数预算,而且其超参数使用固定数量的数据和计算时间来优化。我们用一个由两大公共基准(Inria和DeepGlobe)组成的大型和多样数据集进行实验。我们进行了额外的反动研究,以探讨以特定变压器为基础的模型选择的影响。我们的结果显示,变压器提供一致但微小的性能改进。我们只看到这种优势,即混合模型结合了以革命和变压器为基础的结构,而完全以变压器为基础的模型则实现相对差的性能。