Humans can reason compositionally whilst grounding language utterances to the real world. Recent benchmarks like ReaSCAN use navigation tasks grounded in a grid world to assess whether neural models exhibit similar capabilities. In this work, we present a simple transformer-based model that outperforms specialized architectures on ReaSCAN and a modified version of gSCAN. On analyzing the task, we find that identifying the target location in the grid world is the main challenge for the models. Furthermore, we show that a particular split in ReaSCAN, which tests depth generalization, is unfair. On an amended version of this split, we show that transformers can generalize to deeper input structures. Finally, we design a simpler grounded compositional generalization task, RefEx, to investigate how transformers reason compositionally. We show that a single self-attention layer with a single head generalizes to novel combinations of object attributes. Moreover, we derive a precise mathematical construction of the transformer's computations from the learned network. Overall, we provide valuable insights about the grounded compositional generalization task and the behaviour of transformers on it, which would be useful for researchers working in this area.
翻译:人类可以用真实世界的语言来解释其构成。 最近的一些基准, 比如 ReaSCAN 使用基于网格世界的导航任务来评估神经模型是否具有相似的能力。 在这项工作中, 我们展示了一个简单的基于变压器的模型, 它比ReaSCAN 的专门架构和GSCAN 的修改版本更完善。 在分析任务时, 我们发现, 确定网格世界的目标位置是模型的主要挑战 。 此外, 我们显示, 在ReaSCAN 中, 一个特定的分割是不公平的, 用来测试深度的概括化。 在经过修改的分割版本中, 我们显示变压器可以概括到更深的输入结构。 最后, 我们设计了一个简单的基于变压器的配置概括化任务, RefEx, 来调查变压器的构成原理。 我们显示, 一个单一的自我保护层, 其头部可以对对象属性进行新组合。 此外, 我们从所学的网络中得出变压器计算结果的精确的数学构造结构结构。 总体而言, 我们提供有宝贵的洞察到基础的构成一般化任务和变压变压器行为。