Writing reports by analyzing medical images is error-prone for inexperienced practitioners and time consuming for experienced ones. In this work, we present RepsNet that adapts pre-trained vision and language models to interpret medical images and generate automated reports in natural language. RepsNet consists of an encoder-decoder model: the encoder aligns the images with natural language descriptions via contrastive learning, while the decoder predicts answers by conditioning on encoded images and prior context of descriptions retrieved by nearest neighbor search. We formulate the problem in a visual question answering setting to handle both categorical and descriptive natural language answers. We perform experiments on two challenging tasks of medical visual question answering (VQA-Rad) and report generation (IU-Xray) on radiology image datasets. Results show that RepsNet outperforms state-of-the-art methods with 81.08 % classification accuracy on VQA-Rad 2018 and 0.58 BLEU-1 score on IU-Xray. Supplementary details are available at https://sites.google.com/view/repsnet
翻译:分析医疗图像的撰写报告对经验不足的从业者来说容易出错,对经验丰富的从业者来说耗时费时。在这项工作中,我们介绍了RepsNet,它改造了经过培训的视觉和语言模型,以解释医学图像和生成自然语言的自动报告。RepsNet包括一个编码-解码器模型:编码器通过对比学习使图像与自然语言描述相一致,而解码器则通过调整编码图像和最近的邻居搜索所检索的先前描述来预测答案。我们在直截面回答和描述性自然语言答案的视觉解答设置中提出问题。我们实验了两项具有挑战性的任务:医学直观回答(VQA-Rad)和报告生成(I-Xray)关于放射学图像数据集的生成(I-Xray)。结果显示,RepsNet以81.08 %的分类精度,比VQA-Rad 2018和IU-Xray的0.58 BLEU-1分数。补充细节见https://sitesites.gogle.com/view/repsnet.