Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.
翻译:自动抽取术语在域语言理解和若干自然语言处理下游任务方面发挥着必不可少的作用。在本文件中,我们提议对基于变异器的预先培训语言模型的预测能力进行比较研究,以便在多语言跨域环境中进行名词提取。除了评价单语模式在提取单词和多字术语方面的能力外,我们还试验单语和多语模式的集合,对不同语言模式的术语产出组进行交叉或结合,从而将单语模式的使用战略从相关工作中超越了利用多语模式的状态方法,除了荷兰语和法语外,所有语言的ACTER系统都进行了实验,如果这一术语的抽取任务不包括提取指定的实体术语。此外,通过将两种最佳模式的产出结合起来,我们取得了显著的改进。