Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA).The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction. Existing efforts mainly focus on designing sophisticated architectures while utilizing frame- or object-level visual representations. In this paper, we reconsider the multi-modal alignment problem in VideoQA from feature and sample perspectives to achieve better performance. From the view of feature,we break down the video into trajectories and first leverage trajectory feature in VideoQA to enhance the alignment between two modalities. Moreover, we adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature. In addition, we found that VideoQA models are largely dependent on language priors and always neglect visual-language interactions. Thus, two effective yet portable training augmentation strategies are designed to strengthen the cross-modal correspondence ability of our model from the view of sample. Extensive results show that our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark, which demonstrates the effectiveness of the proposed method.
翻译:基于视频中因果关系和时间事件关系的原因,是视频问答(VideoQA)的新目的地。 实现这一目标的主要障碍是语言和视频之间的语义差异,因为它们处于不同程度的抽象状态。 现有的努力主要侧重于设计复杂的结构,同时利用框架或目标层面的视觉形象。 在本文中,我们从特征和样本的角度重新考虑视频QA的多模式匹配问题,以取得更好的性能。 从特征的角度来看,我们将视频分解为轨迹和视频QA的第一个杠杆轨迹特征,以加强两种模式之间的对齐。 此外,我们采用了一个混杂的图表结构,并设计了一个等级框架,使轨迹层次和框架层面的视觉特征与语言特征相一致。此外,我们发现视频QA模式在很大程度上依赖于语言前期,并且总是忽视视觉语言互动。因此,设计了两个有效的且可移植的培训强化战略,以加强我们模型的跨模式对样板的对应能力。广泛的结果显示,我们的方法超越了全州水平的模型,展示了具有挑战性的NExTQA基准的方法。