多式视视视像图像说明的实用关键词表达式 (Contextualized Keyword Representations for Multi-modal Retinal Image Captioning)

Medical image captioning automatically generates a medical description to describe the content of a given medical image. A traditional medical image captioning model creates a medical description only based on a single medical image input. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of the existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with the increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method.

翻译：医学图像字幕自动产生医学描述,描述特定医学图像的内容。传统医学图像字幕模型只根据单一医学图像输入产生医学描述。因此,根据传统方法很难产生抽象医学描述或概念。这种方法限制了医学图像字幕的有效性。多式医学图像字幕是解决这一问题所采用的方法之一。在多式医学图像字幕、文本输入(例如专家定义的关键字)被认为是医学描述生成的主要驱动因素之一。因此,将文本输入和医学图像有效编码对于多式医学图像描述的任务都很重要。在这项工作中,提出了一个新的端到端深度多式医学图像描述模型。使用背景化关键词表达、文字特征强化和遮蔽自留来开发拟议方法。根据对现有多式医学图像描述数据集的评估,实验结果显示,拟议的模型与加增的+53.2% 相比,BLEU-av-g 方法中的+53.2% 和BEU-Lav-Rg 方法中,与C-LAV-R-R+的+IDF6-18 比较,与BU-Lav-R-R-S-Rg 方法中的+LAV-IDE-IDFT-LA+方法中的+该方法有效。