We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code will be available at https://github.com/doc-doc/CoVGT.
翻译:我们提议通过视频图表变换模型(COVGT)以对比方式进行视频解答(VideoQA) 。 CoVGT的独特性和优越性是三重的:1) 它提议一个动态图形变压器模块,通过明确捕捉视觉对象、其关系和动态来编码视频,以进行复杂的时空推理。 2) 它设计单独的视频和文本变压器,以便在视频和文本之间进行对比性学习,以进行QA,而不是用于回答分类的多式变压器。 精细的视频文本交流是通过额外的跨模式互动模块完成的。 3 它通过完全和自我监督的对比式解压器模块加以优化,将正确和不正确的答案之间以及相关和不相关的问题分别编码。 我们通过高级视频编码和QA解决方案显示, CoVGTT在视频推理任务上能够取得比以往更优得多的表演。 它的表演甚至超过了那些以数百万外部数据预先训练出来的模型。我们进一步显示, CoVGTTT还可以从跨模式前的预演化中得益, 并且展示了我们VGTA级前的高级数据。</s>