Radiology report generation (RRG) aims to describe automatically a radiology image with human-like language and could potentially support the work of radiologists, reducing the burden of manual reporting. Previous approaches often adopt an encoder-decoder architecture and focus on single-modal feature learning, while few studies explore cross-modal feature interaction. Here we propose a Cross-modal PROtotype driven NETwork (XPRONET) to promote cross-modal pattern learning and exploit it to improve the task of radiology report generation. This is achieved by three well-designed, fully differentiable and complementary modules: a shared cross-modal prototype matrix to record the cross-modal prototypes; a cross-modal prototype network to learn the cross-modal prototypes and embed the cross-modal information into the visual and textual features; and an improved multi-label contrastive loss to enable and enhance multi-label prototype learning. XPRONET obtains substantial improvements on the IU-Xray and MIMIC-CXR benchmarks, where its performance exceeds recent state-of-the-art approaches by a large margin on IU-Xray and comparable performance on MIMIC-CXR.
翻译:放射报告生成(RRG)旨在自动描述使用类似人类语言的放射图象,并有可能支持放射学家的工作,减少人工报告的负担;以往的做法往往采用编码器-编码器结构,侧重于单一模式特征学习,而很少有研究探索跨模式特征互动;我们在这里提议采用跨模式分质型驱动网络(XPRONET),以促进跨模式学习,并利用它改进放射报告生成的任务;这通过三个设计良好、完全不同和互补的模块来实现:一个共同的跨模式原型矩阵,记录跨模式原型;一个跨模式原型网络,学习跨模式原型,并将跨模式信息纳入视觉和文字特征;以及一个改进的多标签对比损失,以促成和加强多标签原型学习。 XPRONET在I-X光和MIMIMIC-CXR基准方面取得了实质性的改进,其性能在IMU-X光和可比性能上大大超过最近的状态R方法。