Inspired by humans' exceptional ability to master arithmetic and generalize to new problems, we present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts at three levels: perception, syntax, and semantics. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images (i.e., perception), how multiple concepts are structurally combined to form a valid expression (i.e., syntax), and how concepts are realized to afford various reasoning tasks (i.e., semantics), all in a weakly supervised manner. Focusing on systematic generalization, we carefully design a five-fold test set to evaluate both the interpolation and the extrapolation of learned concepts w.r.t. the three levels. Further, we design a few-shot learning split to determine whether or not models can rapidly learn new concepts and generalize them to more complex scenarios. To comprehend existing models' limitations, we undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3 (with the chain of thought prompting). The results indicate that current models struggle to extrapolate to long-range syntactic dependency and semantics. Models exhibit a considerable gap toward human-level generalization when evaluated with new concepts in a few-shot setting. Moreover, we discover that it is infeasible to solve HINT by merely scaling up the dataset and the model size; this strategy contributes little to the extrapolation of syntax and semantics. Finally, in zero-shot GPT-3 experiments, the chain of thought prompting exhibits impressive results and significantly boosts the test accuracy. We believe the HINT dataset and the experimental findings are of great interest to the learning community on systematic generalization.
翻译:受人类获取算术技能并推广到新问题的非凡能力的启发,我们提出了一个新的数据集 HINT (Handwritten arithmetic with INTegers),以检验机器学习在感知、句法和语义三个层面上学习通用概念的能力。在 HINT 中,机器被要求在弱监督的情况下学习如何从原始信号(如图像)中感知概念(即感知),多个概念如何结构性地结合形成有效表达式(即句法),以及如何实现概念以实现各种推理任务(即语义)。我们针对系统化泛化,精心设计了一个五折测试集,以评估所学概念在三个层面上的内插和外推。此外,我们设计了一个少示例学习集来确定模型是否能够快速学习新概念并将其推广到更复杂的情况。为了理解现有模型的局限性,我们进行了大量实验,尝试不同序列到序列模型,包括 RNN,变形者和 GPT-3(使用连续思考提示)。结果表明,现有模型在外推句法依赖和语义方面遇到了困难。在少示例学习的情况下,模型存在相当大的差距,提高的规模和模型大小贡献不大。最后,在零示例 GPT-3 实验中,连续思考提示可以显著提高测试准确性。我们认为 HINT 数据集和实验结果对于系统化泛化的学习社区具有极大的兴趣。