Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-socket and multi-core architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. MESSI supports similarity search using both the Euclidean and Dynamic Time Warping (DTW) distances. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction, and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in ~50msec (30-75msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.
翻译:数据序列相似性搜索是多个不同领域数据系列分析应用的核心操作。然而,最新技术未能提供互动探索或大型数据序列收集分析所需的时间性能,因此无法提供互动探索或大型数据序列收集分析所需的时间性能。在这项工作中,我们提议MESI,这是为现代硬件的模拟操作设计的第一个数据系列指数。我们的指数利用现代硬件平行化机会(即SIMD指示、多软体和多核心结构),以加快指数构建和相似性搜索处理时间。此外,最先进的技术未能提供互动探索或协调平行工人和数据结构所需的时间性能。我们利用合成和真实数据集的实验表明,总体MESISI在指数构建上的速度要快到4x,而查询速度比状态平行方法快到11x。MISI是首次在模拟操作中最大限度地提高其业绩。MISI支持使用Ecidean和动态时间调整(DW)距离进行类似性搜索。我们用合成和真实数据集进行的实验表明,总体MESISI在回答时速度要快到11x。MISISE-Se(30)是用来解算出大型数据系列的实时搜索,在100年大数据序列中进行实时搜索。(30)。