This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled cross-modal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from self-supervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.
翻译:本文提出视频求解解答( VideoQA) 的视频图表变换模型( VGT) 。 VGT的独特性是两重的:1) 它设计了一个动态图形变换模块, 通过明确捕捉视觉对象、它们的关系和复杂时空推理的动态, 将视频编码成视频; 2) 它利用分解的视频和文本变换器, 在视频和文本之间进行相关性比较, 而不是对答案分类进行交织的跨模式变换器。 视觉- 文本交流是通过额外的跨模式互动模块完成的。 有了更合理的视频编码和 QA 解决方案, 我们显示 VGT 在视频变换解任务上可以取得更好的业绩, 挑战在培训前的艺术中, 与之前的动态关系; 它的性能甚至超过那些以百万的外部数据为先入为主的模型。 我们还进一步表明, VGT 也可以从自我超标的跨模式的跨模式前训练中获益, 而数量较小的数据序列。 这些结果清楚地表明VGT/ GGT/ GGT VA 进行一些精细的观察, 展示了我们研究前的理论分析, 能够提升我们的数据。