Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text. While previous work focuses on building systems for inducing grammars on text that are well-aligned with video content, we investigate the scenario, in which text and video are only in loose correspondence. Such data can be found in abundance online, and the weak correspondence is similar to the indeterminacy problem studied in language acquisition. Furthermore, we build a new model that can better learn video-span correlation without manually designed features adopted by previous work. Experiments show that our model trained only on large-scale YouTube data with no text-video alignment reports strong and robust performances across three unseen datasets, despite domain shift and noisy label issues. Furthermore our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
翻译:视频辅助语法上岗培训旨在利用视频信息寻找更准确的对应文本综合语法。 虽然先前的工作重点是建立系统引导与视频内容完全一致的文本语法,但我们调查了文本和视频仅出现在松散通信中的情景。这些数据可以在网上找到,而通信薄弱与语言获取中研究的不确定性问题相似。此外,我们建立了一个新模型,可以更好地学习视频-相距关系,而不必使用先前的工作所手工设计的特征。实验显示,我们仅对大型YouTube数据进行了培训,而没有文本视频对齐报告,尽管存在域变和吵闹的标签问题,但三个未知数据集的功能都很强和稳健。此外,我们的模型的F1分比以往在主域数据方面培训的先进系统高。