Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable 'second opinion' could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.
翻译:在外科手术中,专家外科医生很少,而且往往过多地承担临床和学术工作量。这种超负荷性能往往限制了病人、医科学生或初级居民回答与外科手术有关的问卷的时间特征。有时,学生和初级居民在课堂上不问太多问题以减少干扰。虽然计算机辅助模拟器和记录了过去外科手术程序,让他们可以观察和改进他们的技能,但他们仍然非常依赖医学专家回答他们的问题。如果将外科医生-外科医生系统作为可靠的“第二意见”可以起到备份作用,减轻医疗专家回答这些问题的负担。由于缺乏附加说明的医疗数据和特定域术语的存在,因此无法在外科手术过程中探索甚多的问题。在这项工作中,我们设计了一个外科模拟-外科手术程序解答问卷。扩大了基于MICCAI的内科观察对2018年数据设置和工作流程识别数据设置的挑战,我们引入了两个Surg-VQA模型, 并减少了医疗专家对这些问题的答复。我们用直观图像解析和直观性判解的图像解析。我们用SVA模型和图像解析了Syal-L答案。