High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or human-made plot summaries. It even outperforms human evaluators who have never watched any whole episode before.
翻译:对原始数据视频和电视节目等视频故事的高级理解极具挑战性。 现代视频解答( VideoQA) 系统通常使用更多人为来源, 如绘图合成、脚本、视频描述或知识基础。 在这项工作中, 我们提出了一个在没有外部来源的情况下理解整个故事的新方法。 对话的秘密在于: 不同于以往的任何工作, 我们把对话视为一个吵闹的源, 可以通过对话框总和转换成文本描述, 类似最近的方法处理视频。 每种模式的输入都是由变压器独立编码的, 简单的组合方法将所有模式结合起来, 使用软时间关注时间将长期输入本地化。 我们的模型大大超越了 KnowIT VQA 数据集的艺术状态, 而没有使用特定问题的人类注释或人为的绘图摘要。 它甚至超越了从未看过任何整集的人类评价员。