Automatically describing a video with natural language is regarded as a fundamental challenge in computer vision. The problem nevertheless is not trivial especially when a video contains multiple events to be worthy of mention, which often happens in real videos. A valid question is how to temporally localize and then describe events, which is known as "dense video captioning." In this paper, we present a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner. To combine these two worlds, we integrate a new design, namely descriptiveness regression, into a single shot detection structure to infer the descriptive complexity of each detected proposal via sentence generation. This in turn adjusts the temporal locations of each event proposal. Our model differs from existing dense video captioning methods since we propose a joint and global optimization of detection and captioning, and the framework uniquely capitalizes on an attribute-augmented video captioning architecture. Extensive experiments are conducted on ActivityNet Captions dataset and our framework shows clear improvements when compared to the state-of-the-art techniques. More remarkably, we obtain a new record: METEOR of 12.96% on ActivityNet Captions official test set.
翻译:自动描述带有自然语言的视频被视为计算机视觉的基本挑战。 问题并不小, 特别是当视频包含值得一提的多个事件时, 这个问题通常发生在真实的视频中。 一个有效的问题是如何在时间上本地化和描述事件, 被称为“ 大量视频字幕 ” 。 在本文中, 我们提出了一个密集视频字幕的新框架, 将时间事件提案和每个提案的句子生成统一本地化, 共同以端到端的方式培训它们。 为了将这两个世界结合起来, 我们将一个新的设计, 即描述性回归, 整合到一个单一的测试结构中, 以推断通过句子生成发现的每一提案的描述性复杂性。 这反过来调整每个提案的时间位置。 我们的模型与现有的密集视频字幕描述方法不同, 因为我们提出了联合和全球性的探测和字幕优化, 而框架则独力地利用一个带有属性的视频字幕字幕描述结构。 在活动网 Capitor Captions 数据集上进行了广泛的实验, 而我们的框架显示, 与状态的12. art 技术相比, 我们获得了一个新的记录。 Caplearly.