Stemming is the process of reducing related words to a standard form by removing affixes from them. Existing algorithms vary with respect to their complexity, configurability, handling of unknown words, and ability to avoid under- and over-stemming. This paper presents a fast, simple, configurable, high-precision, high-recall stemming algorithm that combines the simplicity and performance of word-based lookup tables with the strong generalizability of rule-based methods to avert problems with out-of-vocabulary words.
翻译:僵化是将相关单词从标准格式中删减为标准格式的过程,从中剔除相关词。 现有的算法在复杂性、可配置性、处理未知单词和避免低调和过度调用的能力方面各不相同。 本文展示了快速、简单、可配置、高精度、高调、高调、高调、高调的算法,它结合了基于字的外观表格的简单和性能以及基于规则的方法的很强的通用性,以避免在词汇之外出现问题。