Video captioning is one of the challenging problems at the intersection of vision and language, having many real-life applications in video retrieval, video surveillance, assisting visually challenged people, Human-machine interface, and many more. Recent deep learning based methods have shown promising results but are still on the lower side than other vision tasks (such as image classification, object detection). A significant drawback with existing video captioning methods is that they are optimized over cross-entropy loss function, which is uncorrelated to the de facto evaluation metrics (BLEU, METEOR, CIDER, ROUGE). In other words, cross-entropy is not a proper surrogate of the true loss function for video captioning. To mitigate this, methods like REINFORCE, Actor-Critic, and Minimum Risk Training (MRT) have been applied but have limitations and are not very effective. This paper proposes an alternate solution by introducing a dynamic loss network (DLN), providing an additional feedback signal that reflects the evaluation metrics directly. Our solution proves to be more efficient than other solutions and can be easily adapted to similar tasks. Our results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSRVTT) datasets outperform previous methods.
翻译:视频字幕是视觉和语言交汇处的一个具有挑战性的问题,在视频检索、视频监视、协助有视觉挑战的人、人体机器接口等许多方面有许多现实生活中的应用。最近深层次的学习方法显示了有希望的结果,但仍然处于低于其他视觉任务(如图像分类、物体探测等)的低端。现有视频字幕方法的一个重大缺陷是,它们被优化于与事实上的评价指标(LOBU、METEOR、CIDER、ROUGE)不相关的交叉机能损失功能,这与事实上的评价指标(LEU、METEOR、CIDER、ROUGE)不相干。换句话说,交叉机能不是视频字幕真正损失功能的适当替代工具。为了减轻这一点,已经应用了REINFORCE、Acor-Critic和最低风险培训(MRT)等方法,但有局限性,而且效果也不大。本文提出了一种替代解决方案,即引入动态损失网络,提供反映评价指标的额外反馈信号。我们的解决方案比其他解决方案更有效,并且可以轻易适应类似任务。我们关于微调SRMMS格式的数据描述(MS)的结果。