Explaining the decisions of machine learning models is becoming a necessity in many areas where trust in ML models decision is key to their accreditation/adoption. The ability to explain models decisions also allows to provide diagnosis in addition to the model decision, which is highly valuable in scenarios such as fault detection. Unfortunately, high-performance models do not exhibit the necessary transparency to make their decisions fully understandable. And the black-boxes approaches, which are used to explain such model decisions, suffer from a lack of accuracy in tracing back the exact cause of a model decision regarding a given input. Indeed, they do not have the ability to explicitly describe the decision regions of the model around that input, which is necessary to determine what influences the model towards one decision or the other. We thus asked ourselves the question: is there a category of high-performance models among the ones currently used for which we could explicitly and exactly characterise the decision regions in the input feature space using a geometrical characterisation? Surprisingly we came out with a positive answer for any model that enters the category of tree ensemble models, which encompasses a wide range of high-performance models such as XGBoost, LightGBM, random forests ... We could derive an exact geometrical characterisation of their decision regions under the form of a collection of multidimensional intervals. This characterisation makes it straightforward to compute the optimal counterfactual (CF) example associated with a query point. We demonstrate several possibilities of the approach, such as computing the CF example based only on a subset of features. This allows to obtain more plausible explanations by adding prior knowledge about which variables the user can control. An adaptation to CF reasoning on regression problems is also envisaged.
翻译:在许多领域,对ML模型决定的信任是其认证/通过的关键所在。解释模型决定的能力还允许在模型决定之外提供诊断,而模型决定又能够提供诊断,而模型决定在发现错误等情形中非常宝贵。不幸的是,高性能模型没有表现出必要的透明度来使其决定完全可以理解。用于解释这种模型决定的黑箱方法在追踪关于某一投入的模型决定的精确原因方面缺乏准确性。事实上,它们没有能力明确描述该投入的模型决定区域,而这种能力对于确定模型对某一决定或另一个决定的影响是必要的。因此,我们问自己:目前所使用的高性能模型中是否有一种高性能模型,我们可明确和准确地描述投入空间中的决策区域,使用几何性格特征特征来解释这些模型;令人惊讶的是,我们对于任何进入相关投入的模型的模型,都缺乏准确性能的答案,我们只能通过直率的变现的变值模型,其中也包括一系列的性能模型,例如 XGBoost 推算出其直截面的变法。