Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this paper, focusing on the ability of language models to represent syntax, we propose a framework to assess the consistency and robustness of linguistic representations. To this end, we introduce measures of robustness of neural network models that leverage recent advances in extracting linguistic constructs from LLMs via probing tasks, i.e., simple tasks used to extract meaningful information about a single facet of a language model, such as syntax reconstruction and root identification. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures by analysing their performance and robustness with respect to syntax-preserving perturbations. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving perturbations. Our key observation is that emergent syntactic representations in neural networks are brittle. We make the code, trained models and logs available to the community as a contribution to the debate about the capabilities of LLMs.
翻译:神经网络中新兴的语言结构是脆弱的
摘要:据报道,大型语言模型在自然语言处理任务中表现出强大的性能。然而,准确度等性能指标并不衡量模型在代表复杂语言结构方面的能力是否具有稳健性。因此,本文提出了一个框架来评估语言模型的表征一致性和稳定性,以语法表示能力为焦点。为此,我们在最近的提取语言模型中的有意义的信息和语言结构方面取得的进展的基础上,引入了利用探测任务的神经网络模型的稳健性度量,即用于从语言模型中提取单个方面有意义信息的简单任务,例如句法重构和根识别。经验性地,我们通过分析四种LLM(Large Language Models)对六种不同语料库的表现和稳健性,以及语法保持扰动,研究了提出的稳健性度量的性能。我们提供证据表明,上下文无关表示(例如GloVe)在某些情况下与现代LLM(例如BERT)的上下文相关表示相当竞争,但同样脆弱于语法保持扰动。我们的主要观察是,神经网络中的新兴句法表示是脆弱的。我们将代码,训练模型和日志提供给社区,作为对LLM能力讨论的贡献。