Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments demonstrate that, on the task of detecting idiomatic usage, these models perform reasonably well in the one-shot and few-shot scenarios, but that there is significant scope for improvement in the zero-shot scenario. On the task of representing idiomaticity, we find that pre-training is not always effective, while fine-tuning could provide a sample efficient method of learning representations of sentences containing MWEs.
翻译:尽管在各种国家语言方案任务中取得了成功,但经过预先培训的语言模型由于高度依赖构成性,未能有效掌握多字表达的含义,特别是语系。因此,迫切需要建立数据集并采用方法来改进MWE的代表性。现有的数据集仅限于提供表达方式的异性程度,同时提供字面和酌情(单一的)对MWE的不识字解释。这项工作展示了一套新颖的关于自然出现的句子的数据集,其中包括手动分类成精细精细的一套含义,包括英语和葡萄牙语。我们在两项任务中使用这一数据集是为了测试i)一种语言模式检测语系使用情况的能力,以及二)一种语言模型在生成含有语系的句子表达方式方面的有效性。我们的实验表明,在探测语系用途的任务上,这些模型在一线和几线情景中表现得相当良好,但发现在零分图情景中有很大改进的空间,我们发现在测试一线情景时,始终可以提供有效的学习方法的样本。