We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. Existing methods of multi-modal grammar induction focus on learning syntactic grammars from text-image pairs, with promising results showing that the information from static images is useful in induction. However, videos provide even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases. In this paper, we explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Compound PCFG model as the baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities. Our proposed MMC-PCFG is trained end-to-end and outperforms each individual modality and previous state-of-the-art systems on three benchmarks, i.e. DiDeMo, YouCook2 and MSRVTT, confirming the effectiveness of leveraging video information for unsupervised grammar induction.
翻译:我们调查了视频辅助语法感应,从未贴标签的文字及其相应视频中学习了选区分析器。现有的多式语法感应方法侧重于从文本图像对配中学习综合语法,并有希望的结果显示静态图像的信息对感应有用。然而,视频提供了更丰富的信息,不仅包括静态物体,还包括用于导出动词的动作和状态变化。在本文中,我们探索了视频中的丰富特征(如动作、对象、场景、声学、脸、OCR和演讲),将最近的复合PCFG模型作为基线。我们进一步提出了多式模型复合PCFG模型(MCC-PCFG),以有效地汇总不同模式的丰富特征。我们提议的MMC-PCFG在三个基准上对端和前一阶段的艺术系统进行了培训,即Diemo、YouCook2和MSRVTT,以确认将视频信息用于非超压式语法感应的实效。