Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised, heterogeneous models (i.e., different algorithms with varying hyperparameters) for further combination and analysis, rather than relying on a single model. How to accelerate the training and scoring on new-coming samples by outlyingness (referred as prediction throughout the paper) with a large number of unsupervised, heterogeneous OD models? In this study, we propose a modular acceleration system, called SUOD, to address it. The proposed system focuses on three complementary acceleration aspects (data reduction for high-dimensional data, approximation for costly models, and taskload imbalance optimization for distributed environment), while maintaining performance accuracy. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD's effectiveness in heterogeneous OD acceleration, along with a real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm. We open-source SUOD for reproducibility and accessibility.
翻译:外部探测(OD)是一项关键的机器学习(ML)任务,目的是从包括欺诈探测和入侵探测在内的许多高吸量应用的普通样本中找出异常物体。由于缺乏地面真相标签,从业人员往往必须建造大量不受监督的多元模型(即具有不同超光谱的不同算法),以进行进一步的组合和分析,而不是依赖单一模型。如何加快培训和评分新的样品(称为整个文件的预测 )?在本研究中,我们提议建立一个模块加速系统,称为SUOD,以应对这一问题。拟议的系统侧重于三个互补加速方面(高维数据数据减少数据、费用模型近似值和分配环境的任务负荷不平衡优化),同时保持性能准确性。对20多个基准数据集的广泛实验表明SUOD在多种多耗氧加速方面的有效性,同时在主要保健公司IQVIA进行欺诈性索赔分析方面真实的部署案例。我们公开源SUOD为可复制性和可获取性。