High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or human-made plot summaries. It even outperforms human evaluators who have never watched any whole episode before. Code is available at https://engindeniz.github.io/dialogsummary-videoqa
翻译:对原始数据视频和电视节目等视频故事的高级别理解极具挑战性。 现代视频解答系统( VideoQA) 通常使用更多的人为来源, 如绘图合成、脚本、视频描述或知识基础。 在这项工作中,我们提出了一个在没有外部来源的情况下理解整个故事的新方法。 对话的秘密在于: 与以往的任何工作不同, 我们把对话视为一个吵闹的来源, 可以通过对话框摘要化转换成文本描述, 与最近处理视频的方法非常相似。 每种模式的输入都是由变压器独立编码的, 简单的聚合方法将所有模式结合在一起, 使用软时间关注, 用于长期输入的本地化 。 我们的模型大大超越了 KnowIT VQA 数据集的艺术状态, 而没有使用特定问题的人类注释或人为的绘图摘要。 它甚至超越了以前从未看过任何一集的人类评价员。 代码可以在 https://engindeniz.github.io/dialogsumary-tuvicalqa 上查阅 。