Indexing intervals is a fundamental problem, finding a wide range of applications. Recent work on managing large collections of intervals in main memory focused on overlap joins and temporal aggregation problems. In this paper, we propose novel and efficient in-memory indexing techniques for intervals, with a focus on interval range queries, which are a basic component of many search and analysis tasks. First, we propose an optimized version of a single-level (flat) domain-partitioning approach, which may have large space requirements due to excessive replication. Then, we propose a hierarchical partitioning approach, which assigns each interval to at most two partitions per level and has controlled space requirements. Novel elements of our techniques include the division of the intervals at each partition into groups based on whether they begin inside or before the partition boundaries, reducing the information stored at each partition to the absolutely necessary, and the effective handling of data sparsity and skew. Experimental results on real and synthetic interval sets of different characteristics show that our approaches are typically one order of magnitude faster than the state-of-the-art.
翻译:索引间隔是一个根本问题,它寻求广泛的应用。最近关于管理大量主要记忆间距收集的工作侧重于重叠和时间聚集问题。在本文件中,我们提出了新颖和高效的间隔模拟索引技术,重点是间隔查询,这是许多搜索和分析任务的一个基本组成部分。首先,我们提出了单级(平缩)域分割法的优化版本,由于过度复制,可能具有很大的空间要求。然后,我们提出了等级分隔法,将每个间隔分配到每个级别最多两个分区,并有控制的空间要求。我们技术的创略要素包括根据每个分区的间隔从分区边界内开始还是从分区边界开始,分成若干组,将每个分区储存的信息减少到绝对必要的,以及有效地处理数据宽度和偏差。不同特征的实合成间隔组合的实验结果显示,我们的方法通常比状态快一个数量级。