Image-to-text tasks, such as open-ended image captioning and controllable image description, have received extensive attention for decades. Here, we further advance this line of work by presenting Visual Spatial Description (VSD), a new perspective for image-to-text toward spatial semantics. Given an image and two objects inside it, VSD aims to produce one description focusing on the spatial perspective between the two objects. Accordingly, we manually annotate a dataset to facilitate the investigation of the newly-introduced task and build several benchmark encoder-decoder models by using VL-BART and VL-T5 as backbones. In addition, we investigate pipeline and joint end-to-end architectures for incorporating visual spatial relationship classification (VSRC) information into our model. Finally, we conduct experiments on our benchmark dataset to evaluate all our models. Results show that our models are impressive, providing accurate and human-like spatial-oriented text descriptions. Meanwhile, VSRC has great potential for VSD, and the joint end-to-end architecture is the better choice for their integration. We make the dataset and codes public for research purposes.
翻译:图像到文字任务,如开放式图像字幕和可控图像描述等,几十年来一直受到广泛的关注。 在这里,我们通过展示视觉空间描述(VSD)进一步推进这项工作,这是图像到文字的空间语义学的新视角。鉴于图像和其中的两个对象,VSD旨在生成一个侧重于两个天体之间的空间视角的描述。因此,我们人工对数据集进行说明,以便利调查新展开的任务,并用VL-BART和VL-T5作为主干线来建立几个基准编码-解密模型。此外,我们调查将视觉空间关系分类(VSRC)信息纳入模型的管道和联合端到端结构。最后,我们用基准数据集进行实验,以评价我们的所有模型。结果显示,我们的模型令人印象深刻,提供了准确和人性的空间导向文本描述。同时,VSRC对VSD有着巨大的潜力,而联合端到端结构是将其整合的更好选择。我们为研究目的制作数据集和公用代码。