准确、高效的时间序列,与季节和趋势均知的近似符号相匹配 -- -- 扩展版,包括补充评价和证据 (Accurate and Efficient Time Series Matching by Season- and Trend-aware Symbolic Approximation -- Extended Version Including Additional Evaluation and Proofs)

2021 年 5 月 31 日

Accurate and Efficient Time Series Matching by Season- and Trend-aware Symbolic Approximation -- Extended Version Including Additional Evaluation and Proofs

翻译：准确、高效的时间序列,与季节和趋势均知的近似符号相匹配 -- -- 扩展版,包括补充评价和证据

Lars Kegel,Claudio Hartmann,Maik Thiele,Wolfgang Lehner

Processing and analyzing time series data\-sets have become a central issue in many domains requiring data management systems to support time series as a native data type. A crucial prerequisite of these systems is time series matching, which still is a challenging problem. A time series is a high-dimensional data type, its representation is storage-, and its comparison is time-consuming. Among the representation techniques that tackle these challenges, the symbolic aggregate approximation (SAX) is the current state of the art. This technique reduces a time series to a low-dimensional space by segmenting it and discretizing each segment into a small symbolic alphabet. However, SAX ignores the deterministic behavior of time series such as cyclical repeating patterns or trend component affecting all segments and leading to a distortion of the symbolic distribution. In this paper, we present a season- and a trend-aware symbolic approximation. We show that this improves the symbolic distribution and increase the representation accuracy without increasing its memory footprint. Most importantly, this enables a more efficient time series matching by providing a match up to three orders of magnitude faster than SAX.

翻译：处理和分析时间序列数据集已经成为许多领域的核心问题,这些领域需要数据管理系统支持时间序列,作为本地数据类型。这些系统的关键先决条件是时间序列匹配,这仍然是一个挑战性的问题。时间序列是一个高维数据类型,其代表性是存储,其比较是耗时的。在应对这些挑战的表述技术中,符号综合近似(SAX)是当前的最新状态。这一技术通过将一个时间序列分割成一个小的符号字母,将一个时间序列降低到一个低维空间。然而,SAX忽略了时间序列的确定性行为,例如周期性重复模式或影响所有部分的趋势部分,导致象征性分布的扭曲。在本文中,我们展示了一个季节和趋势认知符号近似。我们表明,这在不增加记忆足迹的情况下改进了象征性分布,提高了代表的准确性。最重要的是,通过提供比SAX更快的三个数量级的匹配,使得一个更高效的时间序列能够匹配。