The ability to generalise well is one of the primary desiderata of natural language processing NLP). Yet, what `good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the ground-work to improve both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP, we use that taxonomy to present a comprehensive map of published generalisation studies, and we make recommendations for which areas might deserve attention in the future. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they aim to solve, the type of data shift they consider, the source by which this data shift is obtained, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 previous papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis of the current state of generalisation research in NLP, and make recommendations for the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to up-date as new NLP generalisation studies are published. With this work, we aim to make steps towards making state-of-the-art generalisation testing the new status quo in NLP.
翻译:全面概括是自然语言处理NLP的主要偏差之一。然而,“良好概括”意味着什么,以及应该如何评价它,并没有很好地理解,也没有任何共同的标准来加以评估。在本文件中,我们的目标是为改进这两个问题而开展实地工作;我们为NLP的概括研究提供一个定性和理解的分类学,我们利用该分类学来提出一份已出版的概括研究的综合地图,我们为今后哪些领域值得注意提出建议。我们的分类学基于对一般化研究的广泛文献审查,并载有五大轴,研究可能与此不同:它们的主要动机、它们旨在解决的概括性类型、它们所考虑的数据变化的类型、数据变化的来源、以及建模管道内的变化中心。我们利用我们的分类学来对400多份前检验概括性研究的论文进行分类,总共600多个领域值得注意。我们根据这次审查的结果,对目前一般化研究的状态进行了深入分析,我们准备在NPL进行这一阶段进行总体的测试,我们准备进行新的动态研究,然后在NL上进行新的文件化研究,我们准备进行新的动态研究,然后在网络上进行新的进展,然后进行新的试验,然后进行新的升级,然后进行我们将完成。