The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94\% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models
翻译:由 convolution 神经网络(CNNs) 和视觉变异器(ViTs) 提取的地貌图质量和丰富程度与稳健模型性能直接相关。 在医疗计算机愿景中,这些信息丰富的功能对于在大型数据集中发现罕见案例至关重要。 这项工作展示了“ Scopefor”(Scopefer), 一种用于计算断层成像(CT) 图像的新型多CNN- ViT 系统内出血分类的新型多- CNN- ViT 模型。 范围前结构是可扩缩的模块和模块化的, 使CNN的各种结构结构能够用作具有多样化产出特征和训练前战略的骨干框架。 我们提出了有效的地貌预测方法,以减少CNNCM生成功能的冗余范围,并控制ViLT的输入量大小。 模型性能与不同类型SICSM的精度模型的精度相比,我们提出了一种更小的精度的精度缩缩缩缩度, 模型的精度结构可以用来计算。