Lemmatization is a Natural Language Processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without analyzing whether that is the optimum in terms of downstream performance. Thus, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising: (i) providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages; (ii) in fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain good contextual lemmatizers without seeing any explicit morphological signal; (iii) the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology; (iv) current evaluation practices for lemmatization are not adequate to clearly discriminate between models.
翻译:柠檬化是一项自然语言处理( NLP) 任务, 包括从特定有色词中生成其发音形式或利玛。 柠檬化是便利下游NLP应用的基本任务之一, 对高发音语言尤其重要。 鉴于从有色词中获取利玛的过程可以通过查看其形态合成学类别来解释, 包括精细磨色色素信息, 用于培训背景色素成文器, 已经成为一种常见做法, 但不分析这是否是下游性能的最佳。 因此, 在本文中, 我们实证性地调查了形态学信息的作用, 以六种语言开发背景色美化剂: 巴斯克、 土耳其、 俄语、 捷克、 西班牙语 和 英语。 此外, 与大多数先前的工作不同, 我们还在外地环境环境中评估了乳质素, 这在它们最常用的用途之后, 也是最普通的应用。 因此, 我们研究的结果相当令人惊讶:(i) 以不甚有益的形态化的形态化的形态化的形态, 向上, 提供一种不那么的形态化的形态化的形态化的形态化的形态化的形态化的形态, 。