骨骨折分类的愿景变形器 (Vision Transformer for femur fracture classification)

In recent years, the scientific community has focused on the development of CAD tools that could improve bone fractures' classification, mostly based on Convolutional Neural Network (CNN). However, the discerning accuracy of fractures' subtypes was far from optimal. This paper proposes a modified version of a very recent and powerful deep learning technique, the Vision Transformer (ViT), outperforming CNNs based approaches and consequently increasing specialists' diagnosis accuracy. 4207 manually annotated images were used and distributed, by following the AO/OTA classification, in different fracture types, the largest labeled dataset of proximal femur fractures used in literature. The ViT architecture was used and compared with a classic CNN and a multistage architecture composed of successive CNNs in cascade. To demonstrate the reliability of this approach, 1) the attention maps were used to visualize the most relevant areas of the images, 2) the performance of a generic CNN and ViT was compared through unsupervised learning techniques, and 3) 11 specialists were asked to evaluate and classify 150 proximal femur fractures' images with and without the help of the ViT, then results were compared for potential improvement. The ViT was able to correctly predict 83% of the test images. Precision, recall and F1-score were 0.77 (CI 0.64-0.90), 0.76 (CI 0.62-0.91) and 0.77 (CI 0.64-0.89), respectively. The average specialists' diagnostic improvement was 29% when supported by ViT's predictions, outperforming the algorithm alone. This paper showed the potential of Vision Transformers in bone fracture classification. For the first time, good results were obtained in sub-fractures classification, with the largest and richest dataset ever. Accordingly, the assisted diagnosis yielded the best results, proving once again the effectiveness of a coordinated work between neural networks and specialists.

翻译：近些年来,科学界侧重于开发CAD工具,这些工具可以改善骨折的分类,主要基于进化神经网络(CNN)。然而,骨折子型的辨别准确性远非最佳。本文件建议修改最新和强大的深层学习技术,即View变形器(Viet),优于CNN的测试方法,从而提高了专家诊断的准确性。通过AO/OTA的分类,使用并分发了4207个手动附加说明的图像,在不同骨折类型中,是文献中使用的产骨骨折骨裂最接近的标记数据集。 ViT的图像结构被使用,与经典CNN的直径型结构以及由连续的CNN组成的多阶段结构相比,远远不够理想性。1 使用关注图将最相关的图像领域视觉化,2 通用CNN和ViT的性能通过非超超强的学习技术再次比较,3 11个专家被请求对150个直位骨质骨质骨折图像进行了评估和分类,仅由ViT的改进和无助力分析,然后对结果进行了分析,对结果进行了精确的预测,对结果进行了分析,对结果进行了分析,对结果进行了精确结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了精确结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析,对结果进行了分析。