Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning, leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This kind of data meets the premise of shifting the input distribution: it covers a large time span ($10$ years), with naturally occurring changes over time (\eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models (MLM to classical Isolation Forest). Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical IID training (by up to $3\%$, on average). Dataset and code are available at https://github.com/bit-ml/AnoShift/.
翻译:分析数据分布的转变是当今机器学习中日益扩大的研究方向,导致出现新的基准,重点是为研究ML模型的通用特性提供合适的情景。现有基准侧重于监督学习,而我们的知识中,没有未经监督的学习。因此,我们引入了一个不受监督的异常检测基准,数据随时间变化,由京都2006年+建立,一个用于网络入侵检测的交通数据集。这类数据符合改变输入分布的前提:它涵盖一个很长的时间范围(10万美元),随着时间的流逝而自然发生变化(例如用户改变其行为模式和软件更新)。我们首先强调数据的非静止性质,使用基本的每个功能分析、t-SNE和最佳运输方法衡量不同年份之间的总体分布距离。接下来,我们提出AnoShift协议,将IID、NEAR和FAR测试的数据分解。我们用不同的模型(MM到古典的ISOLT/Forest)来验证业绩的恶化情况(例如用户修改其行为模式、改变其行为模式,以及软件更新)。最后,我们通过确认分配情况的变化,在ID/IMSA上可以进行。