This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions. We believe medical videos may provide the best possible answers to many first aids, medical emergency, and medical education questions. Toward this, we created the MedVidCL and MedVidQA datasets and introduce the tasks of Medical Video Classification (MVC) and Medical Visual Answer Localization (MVAL), two tasks that focus on cross-modal (medical language and medical video) understanding. The proposed tasks and datasets have the potential to support the development of sophisticated downstream applications that can benefit the public and medical practitioners. Our datasets consist of 6,117 annotated videos for the MVC task and 3,010 annotated questions and answers timestamps from 899 videos for the MVAL task. These datasets have been verified and corrected by medical informatics experts. We have also benchmarked each task with the created MedVidCL and MedVidQA datasets and proposed the multimodal learning methods that set competitive baselines for future research.
翻译:本文介绍了一项新的挑战和数据集,以促进研究设计能够理解医疗录像和提供自然语言问题的直观答案的系统。我们认为,医疗录像可以为许多急救、医疗紧急和医疗教育问题提供最佳的答案。为此,我们创建了医疗病毒控制(MedVidCL)和医疗病毒QA数据集,并引入了医疗录像分类(MVC)和医疗视觉解析(MVAL)的任务。这两项任务侧重于跨模式(医疗语言和医疗录像)理解。拟议的任务和数据集有可能支持发展尖端的下游应用,使公众和医疗从业人员受益。我们的数据集包括6 117个附加说明的MVC任务视频,3 010个附加说明的问答时间,来自899个视频的MVAL任务。这些数据集已经医学信息专家核实和纠正。我们还以创建的MedVidCL和MedVidQA数据集为每项任务的基准,并提出了为未来研究设定竞争性基准的多式学习方法。