Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors. The task demands new capabilities to integrate video processing, language understanding, binding abstract linguistic concepts to concrete visual artifacts, and deliberative reasoning over spacetime. Neural networks offer a promising approach to reach this potential through learning from examples rather than handcrafting features and rules. However, neural networks are predominantly feature-based - they map data to unstructured vectorial representation and thus can fall into the trap of exploiting shortcuts through surface statistics instead of true systematic reasoning seen in symbolic systems. To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level pattern recognition and high-level symbolic algebra. To this end, we propose a new query-guided representation framework to turn a video into an evolving relational graph of objects, whose features and interactions are dynamically and conditionally inferred. The object lives are then summarized into resumes, lending naturally for deliberative relational reasoning that produces an answer to the query. The framework is evaluated on major Video QA datasets, demonstrating clear benefits of the object-centric approach to video reasoning.
翻译:视频解答(Video QA)是对人类类似智能行为的强大测试点。 这项任务要求具备新的能力,将视频处理、语言理解、将抽象语言概念与具体视觉文物联系在一起的抽象语言概念和空间时间的思考推理结合起来。 神经网络为通过从实例而不是手工艺特征和规则中学习来发掘这一潜力提供了很有希望的方法。 然而,神经网络主要以地貌为基础 — — 它们将数据映射成无结构的矢量表达方式,从而可能陷入通过表面统计来利用捷径的陷阱,而不是在象征性系统中看到的真正的系统推理。 为了解决这一问题,我们倡导以对象中心为代表,作为从视频中建立spatio-时间结构的基础,从根本上弥补低层次模式识别和高层次象征性代数代数之间的语义差距。 为此,我们提议一个新的查询制代表框架,将视频转换成不断变化的对象关系图,其特征和互动是动态和有条件的推断。 然后,物体生命被归纳为再现,自然地用于从视频中得出对查询对象的答案的思考性推理。 框架在主要视频数据推理学上被评估。