The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.
翻译:确定文本文件相似性的任务在许多领域,如信息检索、文本采矿、自然语言处理(NLP)和计算语言学,都受到相当重视。将数据转移到数字矢量是一项复杂的任务,在其中使用了代号化、断字过滤、断字和术语加权等算法。频度-反向文档频率(TF-IDF)是便利搜索相关文件的最广泛使用的用词权重方法。为了改进术语的权重,大量TF-IDF扩展。本文建议了TF-IDF方法的另一个延伸,在考虑同义词的情况下。该方法的有效性通过对Cosine、Dice和Jacard等功能的实验得到确认,以测量哈萨克语文本文件的相似性。