Anomalies (or outliers) are prevalent in real-world empirical observations and potentially mask important underlying structures. Accurate identification of anomalous samples is crucial for the success of downstream data analysis tasks. To automatically identify anomalies, we propose Probabilistic Robust AutoEncoder (PRAE). PRAE is designed to simultaneously remove outliers and identify a low-dimensional representation for the inlier samples. We first present the Robust AutoEncoder (RAE) intractable objective as a minimization problem for splitting the data to inlier samples from which a low dimensional representation is learned via an AutoEncoder (AE), and anomalous (outlier) samples that are excluded as they do not fit the low dimensional representation. RAE minimizes the autoencoder's reconstruction error while incorporating as many samples as possible. This could be formulated via regularization by subtracting from the reconstruction term an $\ell_0$ norm counting the number of selected samples. Unfortunately, this leads to an intractable combinatorial problem. Therefore, we propose two probabilistic relaxations of RAE, which are differentiable and alleviate the need for a combinatorial search. We prove that the solution to the PRAE problem is equivalent to the solution of RAE. We use synthetic data to show that PRAE can accurately remove outliers in a wide range of contamination frequencies. Finally, we demonstrate that using PRAE for anomaly detection leads to state-of-the-art results on various benchmark datasets.
翻译:异常点( 或异常点) 在现实世界实证观测中十分普遍, 并有可能掩盖重要的基本结构。 精确地识别异常点样本对于下游数据分析任务的成功至关重要。 为了自动识别异常点, 我们建议进行概率性强自动自动编码器( PRAE ) 。 PRAE 旨在同时移除异常点, 并找出隐性样本的低维代表面。 我们首先将“ 机器人自动编码器( RAE) ” 的棘手目标作为一个最小化问题, 将数据分解为离谱的异常点, 通过Auto Encoder (AE) 和 异常( exter) 样本( ) 来学习低维度代表面的样本对于下游数据分析任务的成功率至关重要 。 RAE 将自动编码器的重建错误最小化, 同时尽可能多的样本。 可以通过从重建期中减去 $\ ell_ 0 标准计算所选样本的数量。 不幸的是, 这会导致一个棘手的分类问题。 因此, 我们提议两次精确地对 RAE 的检测结果进行精确的解析变换,, 以显示我们所要用的方法显示的精确度 。