Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
翻译:语言模型显示了数量改进和新的质量能力,规模正在扩大。尽管它们具有潜在的变革影响,但这些新的能力仍然特征不佳。为了给未来研究提供信息,为破坏性的新模型能力做好准备,并减轻对社会有害的效应,我们必须理解语言模型目前和近未来的能力和局限性。为了应对这一挑战,我们引入了“超越模拟游戏”的基准(BIG-bench),BIG-bench目前由来自132个机构的442名作者贡献的204项任务组成。任务主题多种多样,从语言学、童年发展、数学、常识推理、生物学、物理、社会偏向、软件开发等方面汲取了问题。BIG-bench侧重于被认为超出当前语言模型能力的任务和近乎未来的能力和语言模型的局限性。我们评估OpenAI的Google-internorage 游戏模型(BIG-Benning Gench 架构)的行为,以及BIG-bench的开关节流变,跨模型大小达数百万至数千亿个参数。任务。此外,一个由经改进的人类专家评分数小组执行所有任务,但所有任务,以提供强有力的模型基线为基准。结论结论结论结论是:模型和缩缩缩缩缩缩缩缩缩缩缩缩缩缩,在比。结论是:模型和缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的缩。