Data series similarity search is an important operation and at the core of several analysis tasks and applications related to data series collections. Despite the fact that data series indexes enable fast similarity search, all existing indexes can only answer queries of a single length (fixed at index construction time), which is a severe limitation. In this work, we propose ULISSE, the first data series index structure designed for answering similarity search queries of variable length (within some range). Our contribution is two-fold. First, we introduce a novel representation technique, which effectively and succinctly summarizes multiple sequences of different length. Based on the proposed index, we describe efficient algorithms for approximate and exact similarity search, combining disk based index visits and in-memory sequential scans. Our approach supports non Z-normalized and Z-normalized sequences, and can be used with no changes with both Euclidean Distance and Dynamic Time Warping, for answering both k-NN and epsilon-range queries. We experimentally evaluate our approach using several synthetic and real datasets. The results show that ULISSE is several times, and up to orders of magnitude more efficient in terms of both space and time cost, when compared to competing approaches. (Paper published in VLDBJ 2020)
翻译:数据序列相似性搜索是一项重要操作,是数据收集系列相关若干分析任务和应用的核心。尽管数据序列指数能够快速相似性搜索,但所有现有指数只能回答单长的查询(在指数构建时固定),这是一个严格的限制。在这项工作中,我们提议ULISSE,这是第一个数据序列索引结构,用于回答相近性查询(在一定范围内),长度不一的查询。我们的贡献是双重的。首先,我们采用了一种新型表述技术,它有效和简洁地概括了不同长度的多个序列。根据拟议的指数,我们描述了近似性和精确相似性搜索的有效算法,将基于磁盘的索引访问和模拟序列扫描合并在一起。我们的方法支持非Z调整和Z调整的序列,并且可以在不作任何改动的情况下使用该数据序列,用于回答K-NN和Epsilon-范围查询。我们用若干合成和真实的数据集对我们的方法进行了实验性评估。结果显示,ULISSE是近似和精确的算法数倍,在2020年空间水平上,比VL级比较时,比V级的数值。