Encoding methods are employed across several process mining tasks, including predictive process monitoring, anomalous case detection, trace clustering, etc. These methods are usually performed as preprocessing steps and are responsible for transforming complex information into a numerical feature space. Most papers choose existing encoding methods arbitrarily or employ a strategy based on a specific expert knowledge domain. Moreover, existing methods are employed by using their default hyperparameters without evaluating other options. This practice can lead to several drawbacks, such as suboptimal performance and unfair comparisons with the state-of-the-art. Therefore, this work aims at providing a comprehensive survey on event log encoding by comparing 27 methods, from different natures, in terms of expressivity, scalability, correlation, and domain agnosticism. To the best of our knowledge, this is the most comprehensive study so far focusing on trace encoding in process mining. It contributes to maturing awareness about the role of trace encoding in process mining pipelines and sheds light on issues, concerns, and future research directions regarding the use of encoding methods to bridge the gap between machine learning models and process mining.
翻译:这些方法通常作为预处理步骤,负责将复杂的信息转化为数字特征空间。大多数文件任意选择现有的编码方法,或采用以特定专家知识领域为基础的战略。此外,现有方法的采用是使用其默认的超参数,而没有评估其他选择。这种做法可能导致若干缺点,例如业绩欠佳和与最新技术的不公平比较。因此,这项工作的目的是对事件日志编码进行综合调查,比较27种方法,从不同性质的角度,从表达性、可缩放性、相关性和域名论的角度,比较27种方法。据我们所知,这是迄今为止最全面的研究,重点是在采矿过程中的追踪编码。它有助于加深对追踪编码在采矿管道中的作用的认识,并使人们了解在使用编码方法缩小机器学习模型和进程采矿之间的差距方面的各种问题、关切和未来研究方向。