Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance. Transformer architectures have shown great performance in both machine translation and image captioning, lacking a straightforward and reproducible application for VTT. However, there is no comprehensive study on different strategies and advice for video description generation including exploiting the accompanying audio with fully self-attentive networks. Thus, we explore promising approaches from image captioning and video processing and apply them to VTT by developing a straightforward Transformer architecture. Additionally, we present a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset to determine a configuration applicable to unseen datasets that helps describe short video clips in natural language and improved the CIDEr and BLEU-4 scores by 37.13 and 12.83 points compared to a vanilla Transformer network and achieve state-of-the-art results on the MSR-VTT and MSVD datasets. Also, FPE helps increase the CIDEr score by a relative factor of 8.6%.
翻译:视频到图文文本(VTT)是自动生成短视听视频剪辑描述的任务,它可以支持低视力者理解YouTube视频的场景。变换器结构在机器翻译和图像字幕方面表现良好,缺乏对VTT的直截了当和可复制应用程序。然而,对于视频描述制作的不同战略和建议没有进行全面研究,包括利用带有完全自我强化网络的视频进行利用。因此,我们探索图像字幕和视频处理的有希望的方法,并通过开发直截了当的变换器结构将其应用到VTTT。此外,我们展示了一种新型的同步音频和视频功能在变换器中实现同步的方式,我们称之为FPE。我们在VATEX数据集上进行了多次实验,以确定适用于可帮助描述自然语言短视频剪辑的不可见数据集的配置,并将CIDER和BLEEU-4的分数比范拉变换器网络的分数增加了37.13和12.83点,并实现了MSR-VTTTTT和MSVD相对要素的CELELEVD的分数。此外,我们用CEVPEVPED的C将MSR-C-LEVLED的分数评分提高C。