Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence. As a result, developing efficient lemmatisation algorithm is the complex task. In recent years it can be observed that deep learning models used for this task outperform other methods including machine learning algorithms. In this paper the polish lemmatizer based on Google T5 model is presented. The training was run with different context lengths. The model achieves the best results for polish language lemmatisation process.
翻译:Lemmatization 是将一个单词的反射形式组合在一起的过程, 这样它们就可以被分析成一个单项, 由单词的 Lemma 或字典形式来识别。 在计算语言学中, lemmatization 是指根据一个单词的预定含义来决定一个单词的利玛的算法过程 。 与结果不同, 双光化取决于正确辨别一个词句中的言词和含义的预期部分, 以及该句周围的大背景 。 因此, 开发高效的异光化算法是一件复杂的任务 。 最近几年, 可以看到, 用于此任务的深深学习模式比其他方法( 包括机器学习算法) 。 本文中介绍了基于 Google T5 模型的抛光精美剂。 培训使用不同的上下文长度。 该模型为光化语言的浸光化进程取得最佳结果 。