Process mining aims to gain knowledge of business processes via the discovery of process models from event logs generated by information systems. The insights revealed from process mining heavily rely on the quality of the event logs. Activities extracted from different data sources or the free-text nature within the same system may lead to inconsistent labels. Such inconsistency would then lead to redundancy in activity labels, which refer to labels that have different syntax but share the same behaviours. Redundant activity labels could introduce unnecessary complexities to the event logs. The identifications of these labels from data-driven process discovery are difficult and rely heavily on human intervention. Neither existing process discovery algorithms nor event data preprocessing techniques can solve such redundancy efficiently. In this paper, we propose a multi-view approach to automatically detect redundant activity labels using not only context-aware features such as control--flow relations and attribute values but also semantic features from the event logs. Our evaluation of several publicly available datasets and a real-life case study demonstrate that our approach can efficiently detect redundant activity labels even with low-occurrence frequencies. The proposed approach can add value to the preprocessing step to generate more representative event logs.
翻译:过程采矿的目的是通过从信息系统产生的事件日志中发现过程模型来了解业务流程。过程采矿所揭示的洞察力在很大程度上取决于事件日志的质量。从不同数据源或同一系统内的自由文本性质中提取的活动可能会导致标签不一致。这种不一致会导致活动标签的冗余,因为活动标签是指带有不同语法但具有相同行为的标签。重复活动标签可能会给事件日志带来不必要的复杂性。数据驱动过程发现中的这些标签的识别十分困难,并严重依赖人类的干预。现有的过程发现算法或事件预处理技术都无法有效解决这种冗余。在本文件中,我们提议采用多视角方法自动检测冗余活动标签,不仅使用控制-流动关系和属性值等环境认知特征,而且还使用事件日志中的语义特征。我们对一些公开提供的数据集的评价和真实的案例研究表明,我们的方法可以有效地探测到重复活动标签,即使使用低隐蔽频率。拟议的方法可以增加前处理步骤的价值,从而产生更具代表性的事件日志。