SUOD: 加速大规模不受监督的异异基因外星探测 (SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection)

Yue Zhao,Xiyang Hu,Cheng Cheng,Cong Wang,Changlin Wan,Wen Wang,Jianing Yang,Haoping Bai,Zheng Li,Cao Xiao,Yunlong Wang,Zhi Qiao,Jimeng Sun,Leman Akoglu

from arxiv, Proceedings of the 4th Conference on Machine Learning and Systems (MLSys). arXiv admin note: text overlap with arXiv:2002.03222

Outlier detection (OD) is a key data mining task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised models that are heterogeneous (i.e., different algorithms and hyperparameters) for further combination and analysis with ensemble learning, rather than relying on a single model. However, this yields severe scalability issues on high-dimensional, large datasets. How to accelerate the training and predicting with a large number of heterogeneous unsupervised OD models? How to ensure the acceleration does not deteriorate detection models' accuracy? How to accommodate the acceleration need for both a single worker setting and a distributed system with multiple workers? In this study, we propose a three-module acceleration system called SUOD (scalable unsupervised outlier detection) to address these questions. It focuses on three complementary aspects to accelerate (dimensionality reduction for high-dimensional data, model approximation for complex models, and execution efficiency improvement for taskload imbalance within distributed systems), while controlling detection performance degradation. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD's effectiveness in heterogeneous OD acceleration. By the submission time, the released open-source system has been widely used with more than 700,000 times downloads. A real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm, is also provided.

翻译：由于缺少地面真相标签,执业者往往必须建立大量不受监督的模型(即不同的算法和超光度计),以便与共同学习进一步结合和分析,而不是依赖单一模型。然而,这在高维、大型数据集上产生了严重的可缩缩缩问题。如何加快培训和预测使用大量不同且不受监督的多用途多用途数据模型?如何确保加速不降低探测模型的准确性?如何满足单一工人设置和多工人分布系统的加速需要?在本研究中,我们提议采用称为SUOD的三模块加速系统(可缩放的、不受监督的外部检测)来解决这些问题。它侧重于加速的三个互补方面(高维度数据尺寸减少,复杂模型近似于复杂模型,在分布的系统内执行任务负荷失衡效率改进),同时控制单个工人设置和多个工人分布的系统分布系统的加速性能需要?在SUODDDM(S)的快速性能分析中,在SUDM 20级数据库中,更普遍地展示了快速性评估标准。