The gastrointestinal (GI) tract of humans can have a wide variety of aberrant mucosal abnormality findings, ranging from mild irritations to extremely fatal illnesses. Prompt identification of gastrointestinal disorders greatly contributes to arresting the progression of the illness and improving therapeutic outcomes. This paper presents an ensemble of pre-trained vision transformers (ViTs) for accurately classifying endoscopic images of the GI tract to categorize gastrointestinal problems and illnesses. ViTs, attention-based neural networks, have revolutionized image recognition by leveraging the transformative power of the transformer architecture, achieving state-of-the-art (SOTA) performance across various visual tasks. The proposed model was evaluated on the publicly available HyperKvasir dataset with 10,662 images of 23 different GI diseases for the purpose of identifying GI tract diseases. An ensemble method is proposed utilizing the predictions of two pre-trained models, MobileViT_XS and MobileViT_V2_200, which achieved accuracies of 90.57% and 90.48%, respectively. All the individual models are outperformed by the ensemble model, GastroViT, with an average precision, recall, F1 score, and accuracy of 69%, 63%, 64%, and 91.98%, respectively, in the first testing that involves 23 classes. The model comprises only 20 million (M) parameters, even without data augmentation and despite the highly imbalanced dataset. For the second testing with 16 classes, the scores are even higher, with average precision, recall, F1 score, and accuracy of 87%, 86%, 87%, and 92.70%, respectively. Additionally, the incorporation of explainable AI (XAI) methods such as Grad-CAM (Gradient Weighted Class Activation Mapping) and SHAP (Shapley Additive Explanations) enhances model interpretability, providing valuable insights for reliable GI diagnosis in real-world settings.
翻译:人类的胃肠道(GI)可能出现多种异常黏膜病变,从轻微刺激到极度致命的疾病不等。及时识别胃肠道疾病对于阻止病情进展和改善治疗效果至关重要。本文提出了一种基于预训练视觉Transformer(ViT)的集成模型,用于准确分类胃肠道内窥镜图像,以识别胃肠道问题与疾病。ViT作为基于注意力的神经网络,通过利用Transformer架构的变革性力量,彻底改变了图像识别领域,在各种视觉任务中实现了最先进的性能。所提出的模型在公开可用的HyperKvasir数据集上进行了评估,该数据集包含10,662张涵盖23种不同胃肠道疾病的图像,用于识别胃肠道疾病。本文提出了一种集成方法,融合了两个预训练模型MobileViT_XS和MobileViT_V2_200的预测结果,这两个模型的准确率分别达到90.57%和90.48%。集成模型GastroViT在所有单个模型中表现最优,在首次涉及23个类别的测试中,平均精确率、召回率、F1分数和准确率分别为69%、63%、64%和91.98%。该模型仅包含2000万参数,且未使用数据增强,同时处理了高度不平衡的数据集。在第二次包含16个类别的测试中,各项分数更高,平均精确率、召回率、F1分数和准确率分别达到87%、86%、87%和92.70%。此外,通过引入可解释人工智能(XAI)方法,如梯度加权类激活映射(Grad-CAM)和沙普利加性解释(SHAP),增强了模型的可解释性,为实际场景中可靠的胃肠道诊断提供了有价值的洞见。