Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis, pushing the state-of-the-art in classification, detection and segmentation tasks. Over the last years, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding impressive levels of performance in the natural image domain, while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore the benefits and drawbacks of transformer-based models for medical image classification. We conduct a series of experiments on several standard 2D medical image benchmark datasets and tasks. Our findings show that, while CNNs perform better if trained from scratch, off-the-shelf vision transformers can perform on par with CNNs when pretrained on ImageNet, both in a supervised and self-supervised setting, rendering them as a viable alternative to CNNs.
翻译:近些年来,视觉变压器(VYTs)成为CNN的竞争性替代品,在自然图像领域产生令人印象深刻的性能,同时拥有一些可能有助于医疗成像任务的有趣的属性。在这项工作中,我们探索了基于变压器的医学成像分类模型的利弊。我们就数个标准的2D医学成像基准数据集和任务进行了一系列实验。我们的调查结果显示,尽管CNN在接受刮伤训练后表现更好,但当在图像网络上接受预先训练时,在受监管和自我监督的环境中,现成的视觉变压器可以与CNN同步运行,从而成为CNNs的可行替代品。</s>