Radiographs are a versatile diagnostic tool for the detection and assessment of pathologies, for treatment planning or for navigation and localization purposes in clinical interventions. However, their interpretation and assessment by radiologists can be tedious and error-prone. Thus, a wide variety of deep learning methods have been proposed to support radiologists interpreting radiographs. Mostly, these approaches rely on convolutional neural networks (CNN) to extract features from images. Especially for the multi-label classification of pathologies on chest radiographs (Chest X-Rays, CXR), CNNs have proven to be well suited. On the Contrary, Vision Transformers (ViTs) have not been applied to this task despite their high classification performance on generic images and interpretable local saliency maps which could add value to clinical interventions. ViTs do not rely on convolutions but on patch-based self-attention and in contrast to CNNs, no prior knowledge of local connectivity is present. While this leads to increased capacity, ViTs typically require an excessive amount of training data which represents a hurdle in the medical domain as high costs are associated with collecting large medical data sets. In this work, we systematically compare the classification performance of ViTs and CNNs for different data set sizes and evaluate more data-efficient ViT variants (DeiT). Our results show that while the performance between ViTs and CNNs is on par with a small benefit for ViTs, DeiTs outperform the former if a reasonably large data set is available for training.
翻译:辐射仪是用于检测和评估病理、治疗规划或临床干预导航和本地化目的的多用途诊断工具,但放射学家的解读和评估可能乏味和容易出错,因此,提出了各种深层次的学习方法,支持放射学家解释放射线图。这些方法主要依靠的是革命神经网络(CNN)从图像中提取特征。特别是对于胸腔射线(Cest X-Rays, CXR)上的多标签病理分类而言,有线电视网已证明非常合适。相反,有线电视网的诠释和评估可能是乏味的和易出错的。相反,尽管在通用图像和可解释的本地特征图上,其高分类性能和可增加临床干预价值的当地特征地图上,提出了各种深层次的学习方法。 ViT并不依赖进化神经网络网络(CVierpers)通常需要过多的培训数据,这在医学领域是一个障碍,而ViT(Vi-T)则需要高水平的对比,因为高性能前数据是高性数据,我们用来进行高性数据的分类。