The insights revealed from process mining heavily rely on the quality of event logs. Activities extracted from healthcare information systems with the free-text nature may lead to inconsistent labels. Such inconsistency would then lead to redundancy of activity labels, which refer to labels that have different syntax but share the same behaviours. The identifications of these labels from data-driven process discovery are difficult and rely heavily on resource-intensive human review. Existing work achieves low accuracy either redundant activity labels are in low occurrence frequency or the existence of numerical data values as attributes in event logs. However, these phenomena are commonly observed in healthcare information systems. In this paper, we propose an approach to detect redundant activity labels using control-flow relations and numerical data values from event logs. Natural Language Processing is also integrated into our method to assess semantic similarity between labels, which provides users with additional insights. We have evaluated our approach through synthetic logs generated from the real-life Sepsis log and a case study using the MIMIC-III data set. The results demonstrate that our approach can successfully detect redundant activity labels. This approach can add value to the preprocessing step to generate more representative event logs for process mining tasks in the healthcare domain.
翻译:从开采过程中发现的洞察力很大程度上取决于事件日志的质量。从具有自由文本性质的保健信息系统中提取的活动可能会导致标签不一致。这种不一致会导致活动标签的冗余,这些标签是指具有不同语法但具有相同行为的标签。从数据驱动的进程中发现这些标签的识别很困难,而且严重依赖资源密集型人类审查。现有工作取得了低准确性,要么重复活动标签发生频率低,要么存在数字数据值作为事件日志的属性。然而,这些现象在保健信息系统中常见。在本文件中,我们建议采用一种方法,利用活动日志中的控制-流量关系和数字数据值来探测多余的活动标签。自然语言处理还被纳入我们评估标签之间语义相似性的方法,为用户提供了更多见解。我们通过真实生命Sepsis日志生成的合成日志以及使用MIMI-III数据集进行的一项案例研究,评估了我们的方法。结果显示,我们的方法能够成功地探测多余的活动标签。这一方法可以增加在采矿前步骤中产生更具代表性的事件日志。