利用与学生-教师联合学习的视听变异器进行视听场景-软件对话和解释 (Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning)

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network.

翻译：在先前的工作中,我们曾提议开展视听场景-软件-软件分析(AVSD)任务,收集AVSD数据集,开发AVSD技术,并在第七和第八对称系统技术挑战(DSTC7,DSTC8)中主持AVSD挑战轨道。在这些挑战中,最优秀的系统严重依赖人为生成的视频内容描述,这些描述在数据集中已有,但在现实世界应用中却无法获得。为了促进真实世界应用的进一步发展,我们在DSTC10中提出了第三次AVSD挑战,并作了两项修改:1)人类创建的描述在推论时间不可用,以及2)系统必须展示时间推理,从视频中找到证据来支持每一项答案。在本文介绍的新任务包括时间推理和我们为DSTC10建立的AVSD数据集的新扩展。我们还引入了使用AV-软件传输的基线系统,我们同时发布了新的数据集。最后,本文还介绍了一种基于学习A-SDSD系统、A-SD模型和A-SD系统双时间推理的系统。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

计算机视觉中的自监督学习与注意力建模

专知会员服务

60+阅读 · 2021年4月11日

【UC伯克利】自监督视觉表示学习，356页ppt，Self-Supervised Visual Learning

专知会员服务

66+阅读 · 2021年1月10日