Due to the importance of the lower bounding distances and the attractiveness of symbolic representations, the family of symbolic aggregate approximations (SAX) has been used extensively for encoding time series data. However, typical SAX-based methods rely on two restrictive assumptions; the Gaussian distribution and equiprobable symbols. This paper proposes two novel data-driven SAX-based symbolic representations, distinguished by their discretization steps. The first representation, oriented for general data compaction and indexing scenarios, is based on the combination of kernel density estimation and Lloyd-Max quantization to minimize the information loss and mean squared error in the discretization step. The second method, oriented for high-level mining tasks, employs the Mean-Shift clustering method and is shown to enhance anomaly detection in the lower-dimensional space. Besides, we verify on a theoretical basis a previously observed phenomenon of the intrinsic process that results in a lower than the expected variance of the intermediate piecewise aggregate approximation. This phenomenon causes an additional information loss but can be avoided with a simple modification. The proposed representations possess all the attractive properties of the conventional SAX method. Furthermore, experimental evaluation on real-world datasets demonstrates their superiority compared to the traditional SAX and an alternative data-driven SAX variant.
翻译:由于限制距离较低和象征性表示方式具有吸引力的重要性,对编码时间序列数据广泛使用了象征性总近似(SAX)的组合,但典型的SAX方法依赖于两种限制性假设:高山分布和可装备的符号。本文提出了两种新的数据驱动的SAX象征性表示,其区别在于其离散步骤。第一个表示,以一般数据压缩和指数化假设为导向,其依据是内核密度估计和劳埃德-马克思量化相结合,以尽量减少信息损失和离散步骤中的平均正方差错误。第二种方法,以高级采矿任务为导向,采用中度-湿重组合法,并显示可加强低度空间的异常探测。此外,我们从理论上核查了以前观察到的内在过程现象,其结果低于预期的中间小巧总近似值差异。这种现象造成额外信息损失,但可以简单地加以避免。拟议的表示具有传统的SAX方法的所有吸引力。此外,关于实际-X数据的实验性评价显示其传统高度与传统的变异性。