加强关于高度噪音数据在线学习模式的有力性 (Enhancing Robustness of On-line Learning Models on Highly Noisy Data)

Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper, we extend a two-layer on-line data selection framework: Robust Anomaly Detector (RAD) with a newly designed ensemble prediction where both layers contribute to the final anomaly detection decision. To adapt to the on-line nature of anomaly detection, we consider additional features of conflicting opinions of classifiers, repetitive cleaning, and oracle knowledge. We on-line learn from incoming data streams and continuously cleanse the data, so as to adapt to the increasing learning capacity from the larger accumulated data set. Moreover, we explore the concept of oracle learning that provides additional information of true labels for difficult data points. We specifically focus on three use cases, (i) detecting 10 classes of IoT attacks, (ii) predicting 4 classes of task failures of big data jobs, and (iii) recognising 100 celebrities faces. Our evaluation results show that RAD can robustly improve the accuracy of anomaly detection, to reach up to 98.95% for IoT device attacks (i.e., +7%), up to 85.03% for cloud task failures (i.e., +14%) under 40% label noise, and for its extension, it can reach up to 77.51% for face recognition (i.e., +39%) under 30% label noise. The proposed RAD and its extensions are general and can be applied to different anomaly detection algorithms.

翻译：已经广泛采用分类算法来检测各种系统的异常现象,例如IoT、云和面部识别等,这是根据数据源干净、功能和标签设置正确这一共同假设进行的。然而,从野生收集的数据可能不可靠,因为粗略的注释或恶意的数据转换导致异常检测不正确。在本文中,我们扩展了一个双层在线数据选择框架:Robust Anomaly 探测器(RAD),新设计的混合式预测,其中两层都有助于最后的异常检测决定。为了适应异常检测的在线性质,我们考虑分类器、重复性清洁和孔知识等相互矛盾的意见的更多特征。我们从野生收集的数据可能会不可靠,从而适应大型累积数据集不断增长的学习能力。此外,我们探索了为困难的数据点提供真实标签额外信息的“Oright ”概念。我们特别侧重于三种情况,(i) 发现10类IOT袭击, (ii) 预测4类的分类、重复性清洁、和孔径端知识。我们从输入的数据流中学习了95的准确性任务, (i) 和直径反变。