Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate ProS, a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering. We develop our method for k-NN queries and demonstrate how it can be applied with the two most popular distance measures, namely, Euclidean and Dynamic Time Warping (DTW). We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Moreover, we describe how this method can be used in order to develop a progressive algorithm for data series classification (based on a k-NN classifier), and we additionally propose a method designed specifically for the classification task. Experiments with several and diverse synthetic and real datasets demonstrate that our prediction methods constitute the first practical solutions to the problem, significantly outperforming competing approaches. This paper was published in the VLDB Journal (2022).
翻译:处理数据系列数量不断增加的现有系统无法保证互动反应时间,即使是类似搜索等基本任务也是如此。因此,在计算最终和精确的数据序列收集时,必须先制定分析方法,通过提供渐进的结果,支持勘探和决策。以前的工作既缺乏效率和准确性,在应用大规模数据序列收集时,则缺乏效率和准确性。我们提出并实验性地评估基于概率的学习的新方法ProS,这是一种为进步的近邻(NNN)查询回答提供质量保障的新方法。我们开发了 k-NNN查询方法,并展示了如何用两种最受欢迎的距离措施,即Eucliidean和动态时间扭曲(DTW)来应用这种方法。我们提供了对在类似搜索期间正在改进的最后答案的初步和渐进性估计,以及用于逐步查询的适当停止标准。此外,我们介绍了如何使用这种方法来制定数据序列分类的渐进算法(以 k-NNS分类法为基础),我们还提出了专门为分类任务设计的方法。我们用多种不同的合成和真实的数据转换方法实验了我们所公布的第一种实际的预测方法。