Generative AI models have impressive performance on many Natural Language Processing tasks such as language understanding, reasoning and language generation. One of the most important questions that is being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative Large Language Models (LLMs) are restricted to English and it is unclear how capable these models are at understanding and generating other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 8 diverse tasks and 33 typologically diverse languages. We also compare the performance of generative LLMs to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and discuss some of the reasons why generative LLMs are currently not optimal for all languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.
翻译:生成式AI模型在自然语言处理任务(如语言理解、推理和语言生成)方面表现出色。AI社区今天提出的一个最重要的问题是这些模型的能力和局限性,而评估生成式AI却非常具有挑战性。大多数关于生成大型语言模型(LLMs)的研究局限于英语,而这些模型对其他语言的理解和生成能力尚不清楚。我们介绍了对生成LLMs的第一次综合基准测试-MEGA,它在标准NLP基准测试中对模型进行了评估,包括8个不同任务和33种类型多样的语言。我们还将生成LLMs的性能与这些任务上的最先进(SOTA)的非自回归模型进行比较,以确定生成模型相对于上一代LLMs表现如何。我们对跨语言性能进行了全面分析,并讨论了一些原因,解释为什么生成LLMs目前并不适合所有语言。我们创建了一个评估跨语言生成LLMs的框架,并提供了未来在该领域取得进展的方向。