Accurate localization of the fovea is a crucial initial step in analyzing retinal diseases since it helps prevent irreversible vision loss. Although current deep learning-based methods achieve better performance than traditional methods, they still face challenges such as inadequate utilization of anatomical landmarks, sensitivity to diseased retinal images, and various image conditions. In this paper, we propose a novel transformer-based architecture (Bilateral-Fuser) for multi-cue fusion. The Bilateral-Fuser explicitly incorporates long-range connections and global features using retina and vessel distributions to achieve robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder to extract and fuse self-learned anatomical information. This design focuses more on features distributed along blood vessels and significantly reduces computational costs by reducing token numbers. Our comprehensive experiments demonstrate that the proposed architecture achieves state-of-the-art performance on two public datasets and one large-scale private dataset. Moreover, we show that the Bilateral-Fuser is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
翻译:在分析视网膜疾病方面,精确地定位fovea是分析视网膜疾病的关键初步步骤,因为它有助于防止不可逆转的视力丧失。虽然目前深层次的学习方法比传统方法取得更好的性能,但它们仍然面临一些挑战,例如对解剖标志利用不足、对疾病视网膜图像的敏感度以及各种图像条件等。在本文件中,我们提议为多种聚合建立一个新型的基于变压器的架构(双边-用户),双边用户明确纳入了长距离连接和全球特征,利用视网膜和船舶分布实现稳健的视网膜定位。我们在双流编码中引入了空间关注机制,提取和连接自导解剖面信息。这一设计更多地侧重于在血管上分布的特征,并通过减少象征性数字大幅降低计算成本。我们的全面实验表明,拟议的架构在两个公共数据集和一个大型私人数据集上取得了最先进的性能。此外,我们表明,双边用户在正常和疾病视网膜图像上都更加稳健,并在交叉数据实验中具有更好的普及能力。</s>