Subsequence matching has appeared to be an ideal approach for solving many problems related to the fields of data mining and similarity retrieval. It has been shown that almost any data class (audio, image, biometrics, signals) is or can be represented by some kind of time series or string of symbols, which can be seen as an input for various subsequence matching approaches. The variety of data types, specific tasks and their partial or full solutions is so wide that the choice, implementation and parametrization of a suitable solution for a given task might be complicated and time-consuming; a possibly fruitful combination of fragments from different research areas may not be obvious nor easy to realize. The leading authors of this field also mention the implementation bias that makes difficult a proper comparison of competing approaches. Therefore we present a new generic Subsequence Matching Framework (SMF) that tries to overcome the aforementioned problems by a uniform frame that simplifies and speeds up the design, development and evaluation of subsequence matching related systems. We identify several relatively separate subtasks solved differently over the literature and SMF enables to combine them in straightforward manner achieving new quality and efficiency. This framework can be used in many application domains and its components can be reused effectively. Its strictly modular architecture and openness enables also involvement of efficient solutions from different fields, for instance efficient metric-based indexes. This is an extended version of a paper published on DEXA 2012.
翻译:数据类型、具体任务及其部分或完全解决方案的多样性,看来是解决与数据挖掘和类似检索领域有关的许多问题的理想方法;已经表明,几乎任何数据类别(数据、图像、生物测定、信号)都或能够由某种时间序列或一系列符号代表,这些符号可以被视为各种次序列匹配方法的一种投入;数据类型、具体任务及其部分或完整解决方案的多样性,使得选择、执行和匹配适合某一任务的适当解决方案可能既复杂又费时费时;不同研究领域的碎片可能富有成果的组合可能不明显,也不容易实现。该领域的主要作者还提到执行偏差,使得难以适当比较相互竞争的方法。因此,我们提出了一个新的通用子序列匹配框架(SMF),试图通过统一框架克服上述问题,简化和加快基于子序列的匹配相关系统的设计、开发和评估。我们确定了一些相对独立的子任务,而文献和SMF可能无法以直截了当的方式将其组合在一起,从而实现新的质量和高效参与,在2012年采用这一格式时,可以严格地扩大其格式。