基于多构件密钥索引的快速 K- Word 近距离搜索中最佳参数选择 (Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes)

Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the MaxDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of MaxDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings of the XXII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P. 336-350, published by CEUR Workshop Proceedings. The final authenticated version is available online at: http://ceur-ws.org/Vol-2790/

翻译：完整文本搜索通常在当代全文搜索系统中进行。让我们假设搜索查询是一个单词列表。如果文件中的查询单词彼此接近, 自然会认为文件是相关的文件。对于查询由经常出现的单词组成的情况来说, 接近系数甚至更为重要。完整文本搜索要求将每个事件的信息储存在用户可以搜索的每个单词的文件中。对于每个单词在文件中的每一个发生, 我们使用额外的索引来存储关于附近单词的信息, 即文件中与给定单词的距离比Max Distance差或相等的单词。我们曾在先前的著作中显示, 这些指数可以用来将平均查询执行时间提高到130次, 包括高频率的单词。在本文中, 我们考虑搜索业绩和搜索质量如何取决于 Max Dislentance 和其他参数的价值。众所周知的 OV2 文本收藏用于测试结果的实验, 即文件中出现的比给 Max Drial- drial 参数差值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值度值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值