Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficient amounts of data. In comparison to convolutional neural networks, vision transformers have a weaker inductive bias and therefore allow a more flexible feature detection. Due to their promising feature detection, this work explores vision transformers for tumor detection in digital pathology whole slide images in four tissue types, and for tissue type identification. We compared the patch-wise classification performance of the vision transformer DeiT-Tiny to the state-of-the-art convolutional neural network ResNet18. Due to the sparse availability of annotated whole slide images, we further compared both models pretrained on large amounts of unlabeled whole-slide images using state-of-the-art self-supervised approaches. The results show that the vision transformer performed slightly better than the ResNet18 for three of four tissue types for tumor detection while the ResNet18 performed slightly better for the remaining tasks. The aggregated predictions of both models on slide level were correlated, indicating that the models captured similar imaging features. All together, the vision transformer models performed on par with the ResNet18 while requiring more effort to train. In order to surpass the performance of convolutional neural networks, vision transformers might require more challenging tasks to benefit from their weak inductive bias.
翻译:最近,在对足够数量的数据进行预先培训时,视觉变异器显示能够超过超演性神经网络。与进化神经网络相比,视觉变异器的感知偏差较弱,因此能够进行更灵活的特征探测。由于其有希望的特征检测,这项工作探索了在数字病理中用于检测肿瘤的四类整片幻灯片图像的视觉变异器,以及用于组织类型识别的视觉变异器。我们比较了视觉变异器DeiT-Tiny与最先进的神经神经网络ResNet18的偏差分类性表现。由于附加说明的整张幻灯片图像很少,我们进一步比较了两种模型在大量未贴标签的全滑动图像的预知性。由于这两个模型使用最先进的自我监督方法,因此对这两个模型进行了进一步比较。结果显示,视觉变异变器在四种组织类型中的三种中比ResNet18的肿瘤检测效果略好,而ResNet18在剩余的任务中表现略好一些。两种模型的汇总预测是相互关联的,表明模型在具有类似的成像性图像特征,因此,我们进一步比较先用大量未标的全色图像图像图像图像图像图像图像,同时要求进行更富有的变换模型,同时要求其变动的变动模型在变换变换模型,同时进行。