One-class classification is a challenging subfield of machine learning in which so-called data descriptors are used to predict membership of a class based solely on positive examples of that class, and no counter-examples. A number of data descriptors that have been shown to perform well in previous studies of one-class classification, like the Support Vector Machine (SVM), require setting one or more hyperparameters. There has been no systematic attempt to date to determine optimal default values for these hyperparameters, which limits their ease of use, especially in comparison with hyperparameter-free proposals like the Isolation Forest (IF). We address this issue by determining optimal default hyperparameter values across a collection of 246 one-class classification problems derived from 50 different real-world datasets. In addition, we propose a new data descriptor, Average Localised Proximity (ALP) to address certain issues with existing approaches based on nearest neighbour distances. Finally, we evaluate classification performance using a leave-one-dataset-out procedure, and find strong evidence that ALP outperforms IF and a number of other data descriptors, as well as weak evidence that it outperforms SVM, making ALP a good default choice.
翻译:单级分类是一个具有挑战性的机器学习的子领域,在这种学习中,使用所谓的数据描述仪来预测一个完全以该类的正面实例为基础的类别成员,而没有反示例。一些数据描述仪在以往的单级分类研究中显示表现良好,如支持矢量机(SVM),需要设置一个或多个超参数。迄今为止,没有系统地尝试确定这些超参数的最佳默认值,从而限制了其使用,特别是同隔离森林(IF)等无超参数的提议相比,从而限制了其使用方便性。我们处理这一问题的方法是确定最佳的默认超参数值,在从50个不同的真实世界数据集中得出的246个单级分类问题中确定最佳的。此外,我们提议了新的数据描述仪,平均本地化的接近度(ALP),以解决以近邻距离为基础的现有方法的某些问题。最后,我们使用一个允许使用单一数据集来评估分类的性能,并找到强有力的证据,表明ALP超越了IF和若干其他数据选择的默认值,作为薄弱的证据。