Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.
翻译:监管视频字幕模式需要视频字幕模式。 然而, 对于许多目标语言, 没有足够的配对数据。 为此, 我们引入了未配配的视频字幕任务, 目的是在没有配对视频字幕的情况下用目标语言来培训模型。 为了解决这个问题, 一个自然选择是使用两步管道系统: 首先使用视频到配比字幕模式来生成视频插入语言的字幕, 然后使用节流到目标语言的翻译模式。 但是, 对于许多目标语言, 没有足够的配对数据 。 但是, 在这样的管道系统中, 1 视觉信息无法到达翻译模式, 产生与视觉无关的视觉目标标题; 2 生成的旋线性字幕中的错误将传播到翻译模式, 导致目标说明不全。 为了解决这些问题, 我们建议使用视频注入系统( UVC-VI) 的未配比对视频字幕模式, UVC- 首先是将视觉和目标语言域域域的源头视频视频视频视频模板(VIMM) 连接到目标域的源头、 向目标域域域的视频视频服务器的升级到生成数据演示中。