Deep learning models generalize well to in-distribution data but struggle to generalize compositionally, i.e., to combine a set of learned primitives to solve more complex tasks. In sequence-to-sequence (seq2seq) learning, transformers are often unable to predict correct outputs for longer examples than those seen at training. This paper introduces iterative decoding, an alternative to seq2seq that (i) improves transformer compositional generalization in the PCFG and Cartesian product datasets and (ii) evidences that, in these datasets, seq2seq transformers do not learn iterations that are not unrolled. In iterative decoding, training examples are broken down into a sequence of intermediate steps that the transformer learns iteratively. At inference time, the intermediate outputs are fed back to the transformer as intermediate inputs until an end-of-iteration token is predicted. We conclude by illustrating some limitations of iterative decoding in the CFQ dataset.
翻译:深度学习模型一般地概括了分布中的数据,但努力概括组成,即将一组学习过的原始原始数据结合起来,以解决更复杂的任务。在从序列到序列(seq2seq)的学习中,变压器往往无法预测出比培训时要长的示例的正确输出。本文介绍了迭代解码,这是后继2seq的替代数据,即(一) 改进 PCFG 和Cartesian 产品数据集中的变压器组成概括,以及(二) 证据,在这些数据集中,后继2seq变压器不学习未解开的迭代。在迭代解码中,培训实例被细分为变压器迭代学习的中间步骤序列。在回溯时间,中间输出被反馈到变压器中作为中间输入,直到预测电量末符号。我们最后通过说明CFQ数据集中迭代解码的局限性。