When a neural language model (LM) is adapted to perform a new task, what aspects of the task predict the eventual performance of the model? In NLP, systematic features of LM generalization to individual examples are well characterized, but systematic aspects of LM adaptability to new tasks are not nearly as well understood. We present a large-scale empirical study of the features and limits of LM adaptability using a new benchmark, TaskBench500, built from 500 procedurally generated sequence modeling tasks. These tasks combine core aspects of language processing, including lexical semantics, sequence processing, memorization, logical reasoning, and world knowledge. Using TaskBench500, we evaluate three facets of adaptability, finding that: (1) adaptation procedures differ dramatically in their ability to memorize small datasets; (2) within a subset of task types, adaptation procedures exhibit compositional adaptability to complex tasks; and (3) failure to match training label distributions is explained by mismatches in the intrinsic difficulty of predicting individual labels. Our experiments show that adaptability to new tasks, like generalization to new examples, can be systematically described and understood, and we conclude with a discussion of additional aspects of adaptability that could be studied using the new benchmark.
翻译:当神经语言模型(LM)适应新任务时,任务有哪些方面预测该模型的最终性能?在NLP中,LM对个别例子的概括性具有很好的特点,但是对LM适应新任务的系统性方面却几乎没有很好地理解。我们用500个程序生成的序列模型任务,对LM适应性的特点和限度进行了大规模的经验性研究。这些任务结合了语言处理的核心方面,包括词汇性语义学、序列处理、记忆化、逻辑推理和世界知识。我们利用任务Bench500评估了适应性的三个方面,发现:(1)适应性程序在对小型数据集进行记忆化的能力方面差异很大;(2)在任务类型中,适应性程序对复杂任务的构成性适应性;(3)由于在预测个人标签的内在困难方面没有匹配培训标签的分布。我们的实验表明,对新任务的适应性,例如对新例子的概括性,可以系统地描述和理解,我们的结论是,通过研究新的适应性基准,可以研究新的适应性标准。