We analyze the grounded SCAN (gSCAN) benchmark, which was recently proposed to study systematic generalization for grounded language understanding. First, we study which aspects of the original benchmark can be solved by commonly used methods in multi-modal research. We find that a general-purpose Transformer-based model with cross-modal attention achieves strong performance on a majority of the gSCAN splits, surprisingly outperforming more specialized approaches from prior work. Furthermore, our analysis suggests that many of the remaining errors reveal the same fundamental challenge in systematic generalization of linguistic constructs regardless of visual context. Second, inspired by this finding, we propose challenging new tasks for gSCAN by generating data to incorporate relations between objects in the visual environment. Finally, we find that current models are surprisingly data inefficient given the narrow scope of commands in gSCAN, suggesting another challenge for future work.
翻译:我们分析了基于基础的SCAN(GSCAN)基准,该基准最近被提议研究系统化的通用,以便有根有据地理解语言。首先,我们研究最初的基准的哪些方面可以通过多模式研究中常用的方法加以解决。我们发现,基于通用变异器的、具有跨模式关注的模型在大部分GSCAN分裂中取得了显著的绩效,令人惊讶地优于以往工作中更为专业化的方法。此外,我们的分析表明,许多剩余的错误揭示了语言结构系统化的系统性化,无论视觉背景如何,都存在同样的根本性挑战。第二,根据这一发现,我们建议GSCAN通过生成数据,将物体之间的关系纳入视觉环境中,对新的任务提出挑战。最后,我们发现,由于GSCAN的指令范围狭窄,目前的数据效率极低,令人惊讶,对未来工作提出了另一个挑战。