未经监督的外星探测生成反向积极学习 (Generative Adversarial Active Learning for Unsupervised Outlier Detection)

Outlier detection is an important topic in machine learning and has been used in a wide range of applications. In this paper, we approach outlier detection as a binary-classification issue by sampling potential outliers from a uniform reference distribution. However, due to the sparsity of data in high-dimensional space, a limited number of potential outliers may fail to provide sufficient information to assist the classifier in describing a boundary that can separate outliers from normal data effectively. To address this, we propose a novel Single-Objective Generative Adversarial Active Learning (SO-GAAL) method for outlier detection, which can directly generate informative potential outliers based on the mini-max game between a generator and a discriminator. Moreover, to prevent the generator from falling into the mode collapsing problem, the stop node of training should be determined when SO-GAAL is able to provide sufficient information. But without any prior information, it is extremely difficult for SO-GAAL. Therefore, we expand the network structure of SO-GAAL from a single generator to multiple generators with different objectives (MO-GAAL), which can generate a reasonable reference distribution for the whole dataset. We empirically compare the proposed approach with several state-of-the-art outlier detection methods on both synthetic and real-world datasets. The results show that MO-GAAL outperforms its competitors in the majority of cases, especially for datasets with various cluster types or high irrelevant variable ratio.

翻译：外星探测是机器学习的一个重要课题,并且被广泛应用使用。在本文中,我们通过从统一的参考分布中取样潜在外星离子,将异端探测作为一种二元分类问题。然而,由于高维空间数据过于繁多,少数潜在外星可能无法提供足够的信息,协助分类,描述能够有效地将外部线与正常数据区分开的边界。为了解决这个问题,我们提议采用一种新的单一目标生成反向主动学习(SO-GAAL)方法,用于异端检测,这可以直接产生基于发电机和导体之间迷你最大游戏的信息潜在外星。此外,为了防止生成者陷入问题破解模式,当SO-GAAL能够提供充分的信息时,应当确定培训的停止点。但是,没有事先的任何信息,SO-GAAL就非常困难。因此,我们将S-GAAL的网络结构从一个单一发电机扩大到多个具有不同目的的发电机(MO-GAAL),这可以直接产生基于一个小质量游戏的外星系潜在潜在外系。此外,我们可以用多种高层次的数据形式来比较其高层次数据分布。我们用来显示整个轨道上的拟议数据。