Action in video usually involves the interaction of human with objects. Action labels are typically composed of various combinations of verbs and nouns, but we may not have training data for all possible combinations. In this paper, we aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns that are unseen during training time, by leveraging the power of knowledge graphs. Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale since the number of compositional action nodes grows quadratically with respect to the number of verbs and nouns. To address this issue, we propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions. DARK trains a factorized model by first extracting disentangled feature representations for verbs and nouns, and then predicting classification weights using relations in external knowledge graphs. The type constraint between verb and noun is extracted from external knowledge bases and finally applied when composing actions. DARK has better scalability in the number of objects and verbs, and achieves state-of-the-art performance on the Charades dataset. We further propose a new benchmark split based on the Epic-kitchen dataset which is an order of magnitude bigger in the numbers of classes and samples, and benchmark various models on this benchmark.
翻译:视频中的动作通常涉及人类与对象的交互作用。 动作标签通常由各种动词和名词的组合组成, 但我们可能没有所有可能的组合的培训数据。 在本文中, 我们的目标是通过利用知识图的力量, 提高组成动作识别模型的概括能力, 使在培训期间看不见的新动词或新新名词成为培训期间看不见的新动词或新名词。 先前的工作在知识图中使用动词和名词的拼写动作节点, 使得它效率低下, 因为组成动作节点的数量在变形动作节点和名词的数量方面不断增长。 为了解决这个问题, 我们提出我们的方法是: 以知识库( DARK) 分解的动作识别模型或新名词的缩写模式, 利用知识图中分解的特性显示, 然后利用外部知识图中的关系来预测分类的权重。 在外部知识库和名词库中, 变形和名词基数之间的类型限制来自外部知识库基数基数, 以及最终在计算行动时, 变形基准中, 将更精确的值排序。