In this work for Capsule Vision Challenge 2024, we addressed the challenge of multiclass anomaly classification in video capsule Endoscopy (VCE)[1] with a variety of deep learning models, ranging from custom CNNs to advanced transformer architectures. The purpose is to correctly classify diverse gastrointestinal disorders, which is critical for increasing diagnostic efficiency in clinical settings. We started with a baseline CNN model and improved performance with ResNet[2] for better feature extraction, followed by Vision Transformer (ViT)[3] to capture global dependencies. We further improve the results by using Multiscale Vision Transformer (MViT)[4] for improved hierarchical feature extraction, while Dual Attention Vision Transformer (DaViT) [5] delivered best results by combining spatial and channel attention methods. Our best balanced accuracy on validation set [6] was 0.8592 and Mean AUC was 0.9932. This methodology enabled us to improve model accuracy across a wide range of criteria, greatly surpassing all other methods.Additionally, our team capsule commandos achieved 7th place ranking with a test set[7] performance of Mean AUC: 0.7314 and balanced accuracy: 0.3235
翻译:暂无翻译