Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.
翻译:医学视觉问题解答(VQA)是一项重要挑战,因为它将导致更快和更准确的诊断和治疗决定。大多数现有方法将它作为一个多级分类问题处理,将结果限制在预先定义的封闭式解答中。我们注重开放的VQA,并受语言模型最近进展的驱动,将它视为一种基因化任务。我们利用预先培训的语言模型,引入一种特别适合小型、特定域的医学数据集的新颖方法。为正确将医疗图像传送到语言模型,我们开发了一个网络,将提取的视觉特征映射到一套可学习的符号。然后,除了这些问题外,这些可学习的符号直接促进语言模型。我们探索了最近对语言模型的参数高效微调战略,从而可以进行资源和数据高效的微调。我们评估了我们关于基本医学VQA基准的方法,即Slake、OVQA和PathVQA。结果表明,我们的方法在计算效率的同时,超越了各种培训环境中的现有方法。</s>