Convolutional Neural Networks (CNNs) have reigned for a decade as the de facto approach to automated medical image diagnosis. Recently, vision transformers (ViTs) have appeared as a competitive alternative to CNNs, yielding similar levels of performance while possessing several interesting properties that could prove beneficial for medical imaging tasks. In this work, we explore whether it is time to move to transformer-based models or if we should keep working with CNNs - can we trivially switch to transformers? If so, what are the advantages and drawbacks of switching to ViTs for medical image diagnosis? We consider these questions in a series of experiments on three mainstream medical image datasets. Our findings show that, while CNNs perform better when trained from scratch, off-the-shelf vision transformers using default hyperparameters are on par with CNNs when pretrained on ImageNet, and outperform their CNN counterparts when pretrained using self-supervision.
翻译:革命神经网络(CNNs)作为自动医学图像诊断的事实上方法,已经存在了10年。 最近,视觉变压器(VITs)作为CNN的竞争性替代物出现了。 最近,视觉变压器(VITs)作为CNN(CNN)的竞争性替代物出现,产生了类似的性能水平,同时拥有了对医疗成像任务有益的一些有趣的属性。 在这项工作中,我们探索是该转向以变压器为基础的模型,还是我们应该继续使用CNN(CNN)来操作?如果是这样,转用VTs(VITs)进行医学图像诊断的好处和缺点是什么?我们在三个主流医学图像数据集的一系列实验中考虑了这些问题。 我们的研究结果表明,当CNN在接受从零到零的训练时,使用默认超参数的现成变压变压器在CNNs(在对图像网络进行预先培训时)处于与CNN(CNN)相同的水平上,在使用自我超视像的训练前优于他们的CNN对等。