Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks.However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions among objects brings a great challenge to VideoQA task.To address this problem, we propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. Then a graph-based relation encoder is utilized to extract the static relationship between visual objects.To capture the dynamic changes of multimodal objects in different video frames, we consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer. We conduct extensive experiments on a large scale VideoQA dataset, and the experimental results demonstrate that our RHA outperforms the state-of-the-art methods.
翻译:视频问题解答( VideoQA) 是一项具有挑战性的视频理解任务,因为它需要深入理解问题和视频。 先前的研究主要侧重于提取精密的视觉和语言嵌入和语言嵌入,通过微妙的手工制作的网络将它们固定下来。 但是,不同的框架、对象和方式的相关性随时间而变化,而大部分现有方法对此问题置之不理。 不了解物体之间的动态关系和互动对视频QA任务提出了巨大挑战。 为了解决这一问题,我们提议了一个新型的 " 高度注意(RERA) " 框架,以了解视频中对象的静态和动态关系。 特别是,视频和问题首先由预先培训的模型嵌入,以获得视觉和文字特征。 然后,利用基于图表的关系编码来提取视觉对象之间的静态关系。 为了在不同视频框架中捕捉多式联运对象的动态变化,我们考虑了时间、空间和语义关系,并通过分级关注机制将多式特征结合在一起,以预测答案。 我们在一个大规模视频- 数据模型中进行广泛的实验,展示我们的实验结果。