Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, existing temporal grounding datasets are not carefully designed to evaluate the compositional generalizability. To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. When evaluating the state-of-the-art methods on our new dataset splits, we empirically find that they fail to generalize to queries with novel combinations of seen words. We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization. Based on this insight, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into hierarchical semantic graphs, respectively, and learns fine-grained semantic correspondence between the two graphs. Furthermore, we introduce a novel adaptive structured semantics learning approach to derive the structure-informed and domain-generalizable graph representations, which facilitate the fine-grained semantic correspondence reasoning between the two graphs. Extensive experiments validate the superior compositional generalizability of our approach.
翻译:测地时空是将某个特定部分从一个未磨损的视频中定位到一个根据查询句子的任务。 这项工作在计算机视觉界中取得了巨大的动力, 因为它利用自然语言描述的语义多样性,使得活动能够超越预先定义的活动类别。 语义多样性植根于语言的构成性原则, 通过将已知的文字以新颖方式( 概括化) 来系统地描述新语义。 但是, 现有的时间地面数据集不是精心设计来评价组成性一般推理。 为了系统地衡量时间定位模型的构成性一般化, 我们引入一个新的构成性定地基结构任务, 并构建两个新的数据集分割, 即, 夏拉德- CG 和 ActionNet- CG 。 在评估我们新数据集结构分解的状态方法时, 我们从实验上发现它们无法以新语言的组合来概括性查询。 我们认为, 视频和语言内部结构化的语义结构化结构化是实现更高级对等化的关键因素。 基于这种剖析, 我们建议了两个结构化的变化的变化, 将结构结构化的变化的变化结构化结构化结构结构结构结构结构化的对等结构结构结构结构结构结构结构化的对等图,, 学习了我们学习了结构化的对等结构化的变化的图像结构结构化的对等结构化的对等的变, 。