Autoencoders are widely used in outlier detection due to their superiority in handling high-dimensional and nonlinear datasets. The reconstruction of any dataset by the autoencoder can be considered as a complex regression process. In regression analysis, outliers can usually be divided into high leverage points and influential points. Although the autoencoder has shown good results for the identification of influential points, there are still some problems when detect high leverage points. Through theoretical derivation, we found that most outliers are detected in the direction corresponding to the worst-recovered principal component, but in the direction of the well-recovered principal components, the anomalies are often ignored. We propose a new loss function which solve the above deficiencies in outlier detection. The core idea of our scheme is that in order to better detect high leverage points, we should suppress the complete reconstruction of the dataset to convert high leverage points into influential points, and it is also necessary to ensure that the differences between the eigenvalues of the covariance matrix of the original dataset and their corresponding reconstructed results in the direction of each principal component are equal. Besides, we explain the rationality of our scheme through rigorous theoretical derivation. Finally, our experiments on multiple datasets confirm that our scheme significantly improves the accuracy of outlier detection.
翻译:自动编码器因其在处理高维和非线性数据集方面的优势而被广泛用于外向检测。 由自动编码器重建任何数据集可被视为复杂的回归过程。 在回归分析中,外向器通常可以分为高杠杆点和有影响力的点。 虽然自动编码器在识别有影响力的点上取得了良好结果,但在检测高杠杆点时仍然存在一些问题。 通过理论推断,我们发现大多数外向器是在与最坏回收的主要组成部分相对应的方向上检测的,但是在回收良好的主要组成部分的方向上,反常常常被忽视。我们建议一个新的损失函数,解决以上在外向检测中的缺陷。我们计划的核心想法是,为了更好地检测高杠杆点,我们应该抑制对数据集的完全重建,将高杠杆点转换为有影响力的点,并且还有必要确保原始数据集的易变值矩阵及其在每个主要组成部分方向上的相应重建结果之间的差异是相同的。 此外,我们提出了一个新的损失函数功能功能,我们通过多重的实验方法来大大地改进了我们数据的精确性。