Deep neural models, in particular Transformer-based pre-trained language models, require a significant amount of data to train. This need for data tends to lead to problems when dealing with idiomatic multiword expressions (MWEs), which are inherently less frequent in natural text. As such, this work explores sample efficient methods of idiomaticity detection. In particular we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings, on the task of idiomaticity detection. In addition, to further explore generalisability, we focus on the identification of MWEs not present in the training data. Our experiments show that while these methods improve performance on English, they are much less effective on Portuguese and Galician, leading to an overall performance about on par with vanilla mBERT. Regardless, we believe sample efficient methods for both identifying and representing potentially idiomatic MWEs are very encouraging and hold significant potential for future exploration.
翻译:深神经模型,特别是基于变异器的预先培训的语言模型,需要大量数据来培训。这种数据需求往往导致在处理自然文本中自然较少见的多字表达式(MWEs)时出现问题。因此,这项工作探索了典型化检测的有效方法样本。特别是,我们研究了典型化剥削培训(一种微小的分类方法)和BERTRAM(一种创造背景嵌入的有效方法)对特殊性检测任务的影响。此外,为了进一步探索可概括性,我们侧重于确定培训数据中没有的多字表达式(MWEs)的问题。我们的实验表明,这些方法虽然提高了英语的绩效,但对葡萄牙和加利西亚人的效果却要小得多,导致与香草 mBERT等值的总体绩效。然而,我们认为,识别和代表潜在特殊性兆埃的样本高效方法是非常令人鼓舞的,并且为未来探索提供了巨大的潜力。