Numerous algorithms have been proposed for detecting anomalies (outliers, novelties) in an unsupervised manner. Unfortunately, it is not trivial, in general, to understand why a given sample (record) is labelled as an anomaly and thus diagnose its root causes. We propose the following reduced-dimensionality, surrogate model approach to explain detector decisions: approximate the detection model with another one that employs only a small subset of features. Subsequently, samples can be visualized in this low-dimensionality space for human understanding. To this end, we develop PROTEUS, an AutoML pipeline to produce the surrogate model, specifically designed for feature selection on imbalanced datasets. The PROTEUS surrogate model can not only explain the training data, but also the out-of-sample (unseen) data. In other words, PROTEUS produces predictive explanations by approximating the decision surface of an unsupervised detector. PROTEUS is designed to return an accurate estimate of out-of-sample predictive performance to serve as a metric of the quality of the approximation. Computational experiments confirm the efficacy of PROTEUS to produce predictive explanations for different families of detectors and to reliably estimate their predictive performance in unseen data. Unlike several ad-hoc feature importance methods, PROTEUS is robust to high-dimensional data.
翻译:以不受监督的方式为检测异常(外相、新奇)提出了众多的算法。 不幸的是,一般地说,理解为什么将特定样本(记录)标为异常,从而诊断其根源的原因并非无关紧要。 我们建议采用以下降低维度、代用模型方法来解释探测器的决定:将检测模型与仅使用一小部分特征的另一种模型相近;随后,可以在这个低维空间为人类理解提供样本。为此,我们开发了PROTEUS,这是一个自动ML管道,用于制作替代模型,专门为不平衡数据集的特征选择设计。PROTEUS代用模型不仅可以解释培训数据,而且可以解释外观(不见)数据。换句话说,PROTEUS通过对一个不超强探测器的决策表面进行近似化分析来产生预测解释。PROTEUS的精确预测性能是用来测量近似性数据集质量的尺度。 精确的预测性能和精确性能的预测性能是其高分辨率的预测性能。