Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, but also capture the detailed object information. Meanwhile, video representations have great impact on the quality of generated captions. Thus, it is important for video captioning to capture salient objects with their detailed temporal dynamics, and represent them using discriminative spatio-temporal representations. In this paper, we propose a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object-aware local feature aggregation on detected object regions. The main novelties and advantages are: (1) Bidirectional temporal graph: A bidirectional temporal graph is constructed along and reversely along the temporal order, which provides complementary ways to capture the temporal trajectories for each salient object. (2) Object-aware aggregation: Learnable VLAD (Vector of Locally Aggregated Descriptors) models are constructed on object temporal trajectories and global frame sequence, which performs object-aware aggregation to learn discriminative representations. A hierarchical attention mechanism is also developed to distinguish different contributions of multiple objects. Experiments on two widely-used datasets demonstrate our OA-BTG achieves state-of-the-art performance in terms of BLEU@4, METEOR and CIDEr metrics.
翻译:视频字幕旨在自动生成视频内容的自然语言描述,这在最近几年引起了许多关注。生成准确和细微的视频字幕不仅需要理解视频的全球内容,还需要捕捉详细对象信息。同时,视频演示对生成的字幕的质量有着重大影响。因此,视频字幕对于视频字幕非常重要,以其详细的时间动态捕捉突出对象,并使用歧视性的时空表达方式代表这些对象。在本文中,我们提议了一种新的视频字幕说明方法,其依据是用双向时间图(OA-BTG)对目标有觉觉识的物体进行汇总。该方法不仅需要广泛了解视频中突出对象的详细时间动态,而且还需要通过在检测到的物体区域进行目标认知的本地特征汇总,从而了解具有歧视性的微时空表达方式。主要的新之处和优点是:(1)双向时间图:双向时间图是沿时间顺序构建的,为每个突出对象的时向轨图(OA-BTG)提供了补充方法,该方法可以广泛捕捉取视频对象的时空动态对象时间定位图(O-traalwaralalalal-alalalalalalalalalalal contra contra constring:可理解的A-dealalaltraaltraal-deal-dealtraaltraal-traalmastrationalmastrational laveal laveal ors),该,该图中,该图中,该图中,该图中,该图中,该图中,该图中,该图中,该图中,该图中,该图中,该图中,该图中,该图是Btra-deal-dealtra-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-traal-traal-deal-deal-deal-traal-traal-traal-traaltraaltraaltraal-traal-traal-traal-ladal-ladal-ladal-ladal-ladal-ladal-ladal-ladal-ladal-ladal-ladal-ladal-sal-ladal-laction