The ability to collect and store ever more massive databases has been accompanied by the need to process them efficiently. In many cases, most observations have the same behavior, while a probable small proportion of these observations are abnormal. Detecting the latter, defined as outliers, is one of the major challenges for machine learning applications (e.g. in fraud detection or in predictive maintenance). In this paper, we propose a methodology addressing the problem of outlier detection, by learning a data-driven scoring function defined on the feature space which reflects the degree of abnormality of the observations. This scoring function is learnt through a well-designed binary classification problem whose empirical criterion takes the form of a two-sample linear rank statistics on which theoretical results are available. We illustrate our methodology with preliminary encouraging numerical experiments.
翻译:收集和储存更为庞大的数据库的能力伴随着有效处理这些数据库的需要,在许多情况下,大多数观测都具有同样的行为,而这类观测可能只有一小部分是异常的。发现后者被界定为离线,是机器学习应用的主要挑战之一(例如在欺诈探测或预测维护方面)。在本文件中,我们提出一种方法,通过学习反映观测异常程度的特征空间上界定的数据驱动评分功能来解决异常检测问题。这一评分功能是通过设计良好的二进制分类问题来学习的,其经验标准的形式是具有理论结果的双模范线级统计。我们用初步鼓励数字实验的方法来说明我们的方法。