Deep learning technology can be used as an assistive technology to help doctors quickly and accurately identify COVID-19 infections. Recently, Vision Transformer (ViT) has shown great potential towards image classification due to its global receptive field. However, due to the lack of inductive biases inherent to CNNs, the ViT-based structure leads to limited feature richness and difficulty in model training. In this paper, we propose a new structure called Transformer for COVID-19 (COVT) to improve the performance of ViT-based architectures on small COVID-19 datasets. It uses CNN as a feature extractor to effectively extract local structural information, and introduces average pooling to ViT's Multilayer Perception(MLP) module for global information. Experiments show the effectiveness of our method on the two COVID-19 datasets and the ImageNet dataset.
翻译:深层学习技术可以作为一种辅助技术,帮助医生快速准确地识别COVID-19感染。最近,愿景变换器(VIT)因其全球可接受领域,在图像分类方面显示出巨大的潜力。然而,由于CNN缺乏内在的感应偏差,基于ViT的结构导致模型培训的特性丰富性和难度有限。在本文中,我们提议了一个新的结构,称为COVID-19(COVT)变异器,以改善基于ViT的小COVID-19数据集结构的性能。它使用CNN作为特征提取器,有效提取当地结构信息,并引入用于全球信息的ViT多层感知(MLP)模块的平均集合。实验显示了我们在两个COVID-19数据集和图像网络数据集上的方法的有效性。