In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. Formally, given a video and a syntactically valid exemplar sentence, the task aims to generate one caption which not only describes the semantic contents of the video, but also follows the syntactic form of the given exemplar sentence. In order to tackle such an exemplar-based video captioning task, we propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture. The proposed SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network with respect to the encoded syntactic information of the given exemplar sentence. Therefore, SMCG is able to control the states for word prediction and achieve the syntax customized caption generation. We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets. Extensive experimental results demonstrate the effectiveness of our approach on generating syntax controllable and semantic preserved video captions. By providing different exemplar sentences, our approach is capable of producing different captions with various syntactic structures, thus indicating a promising way to strengthen the diversity of video captioning.
翻译:在本文中, 我们调查了一个新颖且富有挑战性的任务, 即可控的视频字幕, 包含一个示例句。 形式上, 给一个视频和具有合成效力的示例句, 任务旨在生成一个标题, 不仅描述视频的语义内容, 也遵循给定示例句的编码合成信息的综合形式 。 因此, SMCG 能够控制各州的文字预测, 实现语法定制字幕生成。 我们通过收集两个公共视频字幕数据集的辅助示例句子, 进行实验。 广泛的实验结果展示了我们不同视频版图解方法的有效性, 提供了我们不同视频版图解的清晰度。 因此, SMCG 能够控制各州的文字预测, 并实现语法定制的字幕生成。 我们通过收集两个公共视频字幕数据集的辅助示例句子来进行实验。 广泛的实验结果展示了我们不同视频版图解方法的实用性, 展示了我们不同视频版版图案的清晰度, 展示了我们不同视频版图解的方法的有效性。