Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising way to efficiently build a training set for video captioning tasks while reducing the need to manually label uninformative examples. In this work we both explore various active learning approaches for automatic video captioning and show that a cluster-regularized ensemble strategy provides the best active learning approach to efficiently gather training sets for video captioning. We evaluate our approaches on the MSR-VTT and LSMDC datasets using both transformer and LSTM based captioning models and show that our novel strategy can achieve high performance while using up to 60% fewer training data than the strong state of the art baselines.
翻译:自动视频字幕旨在培训模型,在视频中为所有部分生成文本描述,然而,最有效的方法需要大量缓慢和昂贵的手工注释。积极学习是有效建立视频字幕任务培训集的有希望的方法,同时减少人工标签不提供信息的例子的必要性。在这项工作中,我们共同探索了自动视频字幕的各种积极学习方法,并表明集成正规组合战略为高效收集视频字幕培训集提供了最佳的积极学习方法。我们利用变压器和基于LSTM的字幕模型评估了我们关于MSR-VTT和LSMDC数据集的方法,并表明我们的新战略可以取得高绩效,同时使用的培训数据比艺术基线的强度要少60%。