Automatic surgical instruction generation is a prerequisite towards intra-operative context-aware surgical assistance. However, generating instructions from surgical scenes is challenging, as it requires jointly understanding the surgical activity of current view and modelling relationships between visual information and textual description. Inspired by the neural machine translation and imaging captioning tasks in open domain, we introduce a transformer-backboned encoder-decoder network with self-critical reinforcement learning to generate instructions from surgical images. We evaluate the effectiveness of our method on DAISI dataset, which includes 290 procedures from various medical disciplines. Our approach outperforms the existing baseline over all caption evaluation metrics. The results demonstrate the benefits of the encoder-decoder structure backboned by transformer in handling multimodal context.
翻译:自动外科导师生成是手术操作内部环境认知外科协助的先决条件。然而,从外科场面生成指令具有挑战性,因为它需要共同理解当前外科活动以及视觉信息和文字描述之间的建模关系。在开放域内神经机翻译和成像字幕任务的启发下,我们引入了一个变压器-后背式编码器-解码器网络,通过自我临界强化学习从外科图像中生成指令。我们评估了我们关于DAISI数据集的方法的有效性,其中包括来自各医学学科的290个程序。我们的方法超越了所有标题评价指标的现有基线。结果显示了由变压器支撑的变压器-解码器结构在处理多式联运背景下的益处。