Spatial objects often come with textual information, such as Points of Interest (POIs) with their descriptions, which are referred to as geo-textual data. To retrieve such data, spatial keyword queries that take into account both spatial proximity and textual relevance have been extensively studied. Existing indexes designed for spatial keyword queries are mostly built based on the geo-textual data without considering the distribution of queries already received. However, previous studies have shown that utilizing the known query distribution can improve the index structure for future query processing. In this paper, we propose WISK, a learned index for spatial keyword queries, which self-adapts for optimizing querying costs given a query workload. One key challenge is how to utilize both structured spatial attributes and unstructured textual information during learning the index. We first divide the data objects into partitions, aiming to minimize the processing costs of the given query workload. We prove the NP-hardness of the partitioning problem and propose a machine learning model to find the optimal partitions. Then, to achieve more pruning power, we build a hierarchical structure based on the generated partitions in a bottom-up manner with a reinforcement learning-based approach. We conduct extensive experiments on real-world datasets and query workloads with various distributions, and the results show that WISK outperforms all competitors, achieving up to 8x speedup in querying time with comparable storage overhead.
翻译:空间对象通常带有文本信息,例如带有描述的兴趣点(POI),这些称为地理文本数据。为了检索这样的数据,已经广泛研究了考虑空间接近度和文本相关性的空间关键词查询。为空间关键词查询设计的现有索引主要是基于地理文本数据构建的,而没有考虑已经接收到的查询分布。然而,以前的研究表明,利用已知的查询分布可以提高未来查询处理的索引结构。在本文中,我们提出了一种面向工作负载的空间关键词查询学习索引WISK,它自适应于优化给定查询工作负载的查询成本。一个关键的挑战是如何在学习索引时利用结构化的空间属性和非结构化的文本信息。我们首先将数据对象分成分区,旨在最小化给定查询工作负载的处理成本。我们证明了分区问题的NP难度,并提出了一种机器学习模型来找到最优的分区。然后,为了获得更多的剪枝能力,我们使用基于强化学习的方法从底层向上建立了一个基于生成分区的层次结构。我们在真实世界的数据集和不同分布的查询工作负载上进行了大量实验,结果显示WISK优于所有竞争对手,在查询时间上获得了高达8倍的加速,存储开销可比。