Accurate localization of fovea is one of the primary steps in analyzing retinal diseases since it helps prevent irreversible vision loss. Although current deep learning-based methods achieve better performance than traditional methods, there still remain challenges such as utilizing anatomical landmarks insufficiently, sensitivity to diseased retinal images and various image conditions. In this paper, we propose a novel transformer-based architecture (Bilateral-Fuser) for multi-cue fusion. This architecture explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder for extracting and fusing self-learned anatomical information. This design focuses more on features distributed along blood vessels and significantly decreases computational costs by reducing token numbers. Our comprehensive experiments show that the proposed architecture achieves state-of-the-art performance on two public and one large-scale private datasets. We also present that the Bilateral-Fuser is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
翻译:在分析视网膜疾病方面,精确地定位fivevea是分析视网膜疾病的主要步骤之一,因为它有助于防止不可逆转的视力丧失。虽然目前深层次的学习方法比传统方法取得较好的性能,但仍然存在一些挑战,例如利用解剖标志不够充分,对疾病视网膜图像和各种图像条件不够敏感。在本文中,我们提议为多种聚合建立一个新型的基于变压器的架构(双边-Fuser)。这一架构明确包括长距离连接和全球特征,利用视网膜和船舶分布实现稳健的视网膜定位。我们在双流编码中引入了一种空间关注机制,用于提取和生成自学解剖面信息。这一设计更多地侧重于在血管上分布的特征,并通过减少象征性数字大幅降低计算成本。我们的全面实验表明,拟议的架构在两个公共和一个大型私人数据集上取得了最先进的性能。我们还表明,双边用户在正常和疾病视网膜图像上都更加坚固,并在交叉数据实验中具有更好的普及能力。