This paper presents Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches. Contrastive Transformer enables existing contrastive learning techniques, often used for image classification, to benefit dense downstream prediction tasks such as semantic segmentation. The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask, subsequently used for hard-negative and hard-positive sampling. The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint. Additionally, the scheme removes the need for huge batch sizes, as each patch is treated as an image. We apply and test Contrastive Transformer for the case of aerial image segmentation, known for low-resolution data, large class imbalance, and similar semantic classes. We perform extensive experiments to show the efficacy of the Contrastive Transformer scheme on the ISPRS Potsdam aerial image segmentation dataset. Additionally, we show the generalizability of our scheme by applying it to multiple inherently different Transformer architectures. Ultimately, the results show a consistent increase in mean IoU across all classes.
翻译:本文提出了对比变压器,这是一种使用变压器固有补丁的对比学习方案。对比变压器使得现有的对比学习技术(通常用于图像分类)可以受益于诸如语义分割之类的密集下游预测任务。该方案执行有监督的补丁级对比学习,基于地面真实蒙版选择补丁,随后用于难负和难正样本采样。该方案适用于所有视觉变压器架构,易于实现,并引入最小的额外内存占用。此外,该方案消除了对大批量大小的需求,因为每个补丁都被视为一张图像。我们将对比变压器应用于低分辨率数据、大类别不平衡和类别语义相似的航空图像分割案例,进行广泛的实验以展示对比变压器方案的有效性。此外,我们通过将其应用于多个固有不同的变压器架构来展示方案的泛化能力。最终,结果显示了所有类别平均IoU的持续增加。