We propose Hercules, a parallel tree-based technique for exact similarity search on massive disk-based data series collections. We present novel index construction and query answering algorithms that leverage different summarization techniques, carefully schedule costly operations, optimize memory and disk accesses, and exploit the multi-threading and SIMD capabilities of modern hardware to perform CPU-intensive calculations. We demonstrate the superiority and robustness of Hercules with an extensive experimental evaluation against state-of-the-art techniques, using many synthetic and real datasets, and query workloads of varying difficulty. The results show that Hercules performs up to one order of magnitude faster than the best competitor (which is not always the same). Moreover, Hercules is the only index that outperforms the optimized scan on all scenarios, including the hard query workloads on disk-based datasets. This paper was published in the Proceedings of the VLDB Endowment, Volume 15, Number 10, June 2022.
翻译:我们建议大力士,这是一个平行的以树为基础的技术,用于对大规模基于磁盘的数据序列进行精确相似的搜索。我们提出了新的指数构建和问答算法,这些算法利用了不同的总和技术,仔细地安排费用高昂的操作,优化了记忆和磁盘存取,并利用现代硬件的多读和SIM能力来进行CPU密集型计算。我们用许多合成和真实的数据集,对最新技术进行广泛的实验性评估,并调查各种困难的工作量。结果显示,大力士比最佳竞争者(并非始终相同)更快达到一个数量级。此外,大力士是唯一能够超越所有情景优化扫描的指数,包括基于磁盘的数据集的硬质查询工作量。这份文件在VLDB捐赠记录中发表,第15卷,第10号,2022年6月。