Stop words are very important for information retrieval and text analysis investigation tasks of natural language processing. Current work presents a method to evaluate the quality of a list of stop words aimed at automatically creating techniques. Although the method proposed in this paper was tested on an automatically-generated list of stop words for the Uzbek language, it can be, with some modifications, applied to similar languages either from the same family or the ones that have an agglutinative nature. Since the Uzbek language belongs to the family of agglutinative languages, it can be explained that the automatic detection of stop words in the language is a more complex process than in inflected languages. Moreover, we integrated our previous work on stop words detection in the example of the "School corpus" by investigating how to automatically analyse the detection of stop words in Uzbek texts. This work is devoted to answering whether there is a good way of evaluating available stop words for Uzbek texts, or whether it is possible to determine what part of the Uzbek sentence contains the majority of the stop words by studying the numerical characteristics of the probability of unique words. The results show acceptable accuracy of the stop words lists.
翻译:停止语对于自然语言处理的信息检索和文本分析调查任务非常重要。 当前的工作为评估旨在自动创造技术的停止语清单的质量提供了一种方法。 虽然本文中建议的方法是在乌兹别克语自动生成的停止语清单中测试的,但经过一些修改,它可以适用于来自同一家庭或具有混杂性质的类似语言。由于乌兹别克语属于混杂语言家庭,因此可以解释,自动检测该语言中的停止语是一个比隐含语言更复杂的过程。 此外,我们通过研究如何自动分析乌兹别克文本中停止语的检测,将我们以前关于停止语探测的工作纳入了“学校材料”的示例中。 这项工作致力于回答对乌兹别克文本的现有停止语进行评估是否有好的方法,或者是否有可能通过研究独有单词的概率的数值特征来确定乌兹别克语句中哪些部分含有大多数的停止语。结果显示禁止语列表的可接受性准确性。