Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.
翻译:了解现实的视觉图像和语言描述是一项基本任务。 以往的工程通过为视觉场景(如景图)和自然语言(如依赖树)单独建立等级结构,显示了令人信服的全面成果。 但是,如何构建共同视觉语言(VL)结构的问题几乎没有调查。 更具挑战性但值得的是, 我们引入一项新的任务, 目标是以不受监督的方式引导这样一个联合VLL结构。 我们的目标是将视觉场景图和语言依赖性树无缝地连接起来。 由于缺乏 VL结构数据, 我们开始建立一个新的数据集VLParse。 我们建议采用自动调整程序, 生产粗糙的结构, 并随后进行人类完善, 以产生高质量的结构。 此外, 我们用一个对比性学习( CLLGAE) 框架来衡量我们的数据集, 这是VLGAE的简称。 我们的模型在两项衍生任务上取得了优异性性表现, 即语言缩略图感和VL 缩略图 — 地面结构。