A large number of annotated video-caption pairs are required for training video captioning models, resulting in high annotation costs. Active learning can be instrumental in reducing these annotation requirements. However, active learning for video captioning is challenging because multiple semantically similar captions are valid for a video, resulting in high entropy outputs even for less-informative samples. Moreover, video captioning algorithms are multimodal in nature with a visual encoder and language decoder. Further, the sequential and combinatorial nature of the output makes the problem even more challenging. In this paper, we introduce MAViC which leverages our proposed Multimodal Semantics Aware Sequential Entropy (M-SASE) based acquisition function to address the challenges of active learning approaches for video captioning. Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function. Our detailed experiments empirically demonstrate the efficacy of M-SASE for active learning for video captioning and improve on the baselines by a large margin.
翻译:培训视频字幕模型需要大量附加说明的视频字幕配对,导致高注注解成本。积极学习有助于减少这些批注要求。然而,积极学习视频字幕具有挑战性,因为多个语义相似的字幕对视频有效,甚至对信息性较差的样本来说都会产生高倍增输出。此外,视频字幕算法具有多式性质,带有视觉编码器和语言解码器。此外,产出的顺序和组合性质使得问题更具挑战性。在本文中,我们引入了MAVIC, 利用我们拟议的多模式识别序列(M-SASE)获取功能来应对视频字幕积极学习方法的挑战。我们的方法将视觉和语言层面的语义相似性和不确定性纳入了获取功能中。我们的详细实验经验证明了M-SASE对积极学习视频字幕的有效性,并大大改进了基线。