Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts.
翻译:视频解答(VideoQA)因其视觉理解和自然语言理解的多式组合而具有挑战性。虽然现有方法很少在多个时间尺度上利用视频中的外观感动信息,但问题与文字语义提取的视觉信息之间的相互作用经常被忽略。针对这些问题,本文建议采用新颖的Temal Pyramid变异器模式,与视频QA的多式互动。TPT模式由两个模块组成,即特定问题变异器(QT)和视觉感知(VI)。鉴于从视频中构建的时间金字塔,QT从每个单词和视觉内容之间的粗略至软形多式混合中构建问题语义。在这类特定问题语义学的指导下,我们从问题与视频的当地到全球多层次互动中推导出视觉线索。在每一个模块中,我们引入了一种多式关注机制,帮助提取问题视频互动,对信息在不同级别上传递的剩余连接。通过对三个视频QA数据集进行广泛的实验,我们用拟议的国家方法更好地展示了业绩。