It would be a technological feat to be able to create a system that can hold a meaningful conversation with humans about what they watch. A setup toward that goal is presented as a video dialog task, where the system is asked to generate natural utterances in response to a question in an ongoing dialog. The task poses great visual, linguistic, and reasoning challenges that cannot be easily overcome without an appropriate representation scheme over video and dialog that supports high-level reasoning. To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed COST - which stands for Conversation about Objects in Space-Time. Here dynamic space-time visual content in videos is first parsed into object trajectories. Given this video abstraction, COST maintains and tracks object-associated dialog states, which are updated upon receiving new questions. Object interactions are dynamically and conditionally inferred for each question, and these serve as the basis for relational reasoning among them. COST also maintains a history of previous answers, and this allows retrieval of relevant object-centric information to enrich the answer forming process. Language production then proceeds in a step-wise manner, taking into the context of the current utterance, the existing dialog, the current question. We evaluate COST on the DSTC7 and DSTC8 benchmarks, demonstrating its competitiveness against state-of-the-arts.
翻译:能够创建一个能够与人类就其观看的事物进行有意义的对话的系统,将是一个技术成就。 为实现这一目标而设置的系统作为视频对话任务提出,要求系统针对当前对话中的问题产生自然的发声。该任务提出了巨大的视觉、语言和推理挑战,如果没有支持高层次推理的视频和对话的适当代表方案,这些挑战就难以克服。为了应对这些挑战,我们为视频对话提出了一个支持神经推理的以对象为中心的新框架,即CST,它代表着空间时物体的争论。在这里,视频中动态的时空视觉内容首先被分割为对象轨迹。鉴于这种视频抽象,COST维持并跟踪与对象有关的对话状态,这些状态在接收新问题时不断更新。对每个问题进行动态和有条件的推断,作为它们之间关联推理的基础。COST还保持了先前答案的历史,从而能够检索相关对象中心信息,以丰富答案的形成过程。 视频中的动态空间时空视频内容首先被分割为对象轨迹。 以渐进的方式制作,然后根据目前- ST 标准,我们当前-ST 评估了当前的D 的竞争力,对当前-ST 的竞争力问题进行评估。