Given an unsupervised outlier detection (OD) task on a new dataset, how can we automatically select a good outlier detection method and its hyperparameter(s) (collectively called a model)? Thus far, model selection for OD has been a "black art"; as any model evaluation is infeasible due to the lack of (i) hold-out data with labels, and (ii) a universal objective function. In this work, we develop the first principled data-driven approach to model selection for OD, called MetaOD, based on meta-learning. MetaOD capitalizes on the past performances of a large body of detection models on existing outlier detection benchmark datasets, and carries over this prior experience to automatically select an effective model to be employed on a new dataset without using any labels. To capture task similarity, we introduce specialized meta-features that quantify outlying characteristics of a dataset. Through comprehensive experiments, we show the effectiveness of MetaOD in selecting a detection model that significantly outperforms the most popular outlier detectors (e.g., LOF and iForest) as well as various state-of-the-art unsupervised meta-learners while being extremely fast. To foster reproducibility and further research on this new problem, we open-source our entire meta-learning system, benchmark environment, and testbed datasets.
翻译:鉴于在新的数据集上有一个未经监督的外部探测(OD)任务,我们如何能够自动选择一个好的外部探测方法及其超参数(统称为模型)?迄今为止,OD的模型选择一直是“黑色艺术”;因为任何模型评估都因缺乏(一) 标签的搁置数据,以及(二) 通用目标功能而不可行。在这项工作中,我们开发了第一个原则数据驱动方法,用于在元学习的基础上选择OD的模式,称为MetAOD。MeOD利用了现有外部探测基准数据集的大量探测模型的以往性能,并在过去的经验中自动选择一个有效的模型,用于新的数据集,而不使用任何标签。为了捕捉任务相似性,我们引入专门的元特征,以量化数据集的特征。通过全面实验,我们展示了MetOD在选择一个大大优于最受欢迎的外部探测器(例如,LOF和IF-F-F-F-Fretest)的公开模型的有效性,同时促进新的测试环境的快速测试。