Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this paper, focusing on the ability of language models to represent syntax, we propose a framework to assess the consistency and robustness of linguistic representations. To this end, we introduce measures of robustness of neural network models that leverage recent advances in extracting linguistic constructs from LLMs via probing tasks, i.e., simple tasks used to extract meaningful information about a single facet of a language model, such as syntax reconstruction and root identification. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures by analysing their performance and robustness with respect to syntax-preserving perturbations. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving perturbations. Our key observation is that emergent syntactic representations in neural networks are brittle. We make the code, trained models and logs available to the community as a contribution to the debate about the capabilities of LLMs.
翻译:据报道,大型语言模型(LLMS)在自然语言处理任务方面表现良好,但准确性等业绩衡量标准无法衡量该模型的质量,因为它能够强有力地代表复杂的语言结构。本文侧重于语言模型代表语法的能力,我们提议了一个框架来评估语言表达的一致性和稳健性。为此,我们采用神经网络模型的稳健性衡量标准,利用最近通过调查任务从LLMS中提取语言构造的进展,即用于获取有关语言模型单一面的有意义信息的简单任务,如合成税重建和根底识别。我们研究六个不同整体的四个LMS的绩效,通过分析它们的表现和对保留语法扰动性表现的稳健性。我们提供了证据,证明在有些案件中,通过测试任务从语言模型中提取语言构建语言结构的无背景代表(e.g.g.GloVe)与现代LMS(e.BERT)的背景代表具有竞争力,但同样地是,通过我们经过培训的社区观测网络的快速性记录模式,我们向关键的网络提供稳定的记录。