There is emerging attention towards working with event sequences. In particular, clustering of event sequences is widely applicable in domains such as healthcare, marketing, and finance. Use cases include analysis of visitors to websites, hospitals, or bank transactions. Unlike traditional time series, event sequences tend to be sparse and not equally spaced in time. As a result, they exhibit different properties, which are essential to account for when developing state-of-the-art methods. The community has paid little attention to the specifics of heterogeneous event sequences. Existing research in clustering primarily focuses on classic times series data. It is unclear if proposed methods in the literature generalize well to event sequences. Here we propose COHORTNEY as a novel deep learning method for clustering heterogeneous event sequences. Our contributions include (i) a novel method using a combination of LSTM and the EM algorithm and code implementation; (ii) a comparison of this method to previous research on time series and event sequence clustering; (iii) a performance benchmark of different approaches on a new dataset from the finance industry and fourteen additional datasets. Our results show that COHORTNEY vastly outperforms in speed and cluster quality the state-of-the-art algorithm for clustering event sequences.
翻译:特别是,对事件序列的分组研究在医疗保健、营销和融资等领域广泛适用。使用案例包括对网站、医院或银行交易访问者的分析。与传统的时间序列不同,事件序列往往稀少,在时间序列中时间间隔不相等。因此,它们具有不同的特性,在制定最新方法时,这些特性对于说明不同事件序列的重要性至关重要。社区很少注意不同事件序列的具体特点。现有集群研究主要侧重于经典时间序列数据。在文献中建议的方法是否概括地反映事件序列。我们在此建议COHORTNEY作为新颖的深入学习方法,用于对不同事件序列进行分组。我们的贡献包括:(一) 采用LSTM与EM算法和代码执行相结合的新方法;(二) 将这种方法与以往关于时间序列和事件序列集的研究进行比较;(三) 金融业新数据集不同方法的业绩基准和14个额外数据集。我们的成果显示,COHORTNEY在活动组合速度和质量序列上大大超越了事件组合。