The matrix profile is an effective data mining tool that provides similarity join functionality for time series data. Users of the matrix profile can either join a time series with itself using intra-similarity join (i.e., self-join) or join a time series with another time series using inter-similarity join. By invoking either or both types of joins, the matrix profile can help users discover both conserved and anomalous structures in the data. Since the introduction of the matrix profile five years ago, multiple efforts have been made to speed up the computation with approximate joins; however, the majority of these efforts only focus on self-joins. In this work, we show that it is possible to efficiently perform approximate inter-time series similarity joins with error bounded guarantees by creating a compact "dictionary" representation of time series. Using the dictionary representation instead of the original time series, we are able to improve the throughput of an anomaly mining system by at least 20X, with essentially no decrease in accuracy. As a side effect, the dictionaries also summarize the time series in a semantically meaningful way and can provide intuitive and actionable insights. We demonstrate the utility of our dictionary-based inter-time series similarity joins on domains as diverse as medicine and transportation.
翻译:矩阵配置是一个有效的数据挖掘工具,它为时间序列数据提供了相似性连接功能。矩阵配置的用户可以使用不同性连接(即自join)加入时间序列,也可以使用不同性连接(即自join)加入时间序列,或者使用另一个时间序列加入时间序列并使用不同性连接。通过援引任一或两种类型的连接,矩阵配置可以帮助用户发现数据中的受保护和异常结构。自5年前引入矩阵配置以来,已经做出了多项努力,以近似连接加快计算速度;然而,这些努力大多只侧重于自我joins。在这项工作中,我们表明,通过创建时间序列的压缩“词典”表示,可以高效地执行大约的跨时间序列,同时提供有限制的保证。使用字典代表而不是原始的时间序列,我们能够至少20X来改进异常采矿系统的吞吐量,而基本上没有降低准确性。作为副作用,词典还将时间序列以具有实际意义的方式归纳为时间序列。我们能够有效地进行跨时间序列的解读,并且可以作为跨域域的解读和动作解释。