Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning (ML), leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span ($10$ years), with naturally occurring changes over time (eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models, ranging from classical approaches to deep learning. Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical training which assumes independent and identically distributed data (on average, by up to $3\%$ for our approach). Dataset and code are available at https://github.com/bit-ml/AnoShift/.
翻译:分析数据分布的转变是当今机器学习(ML)中日益扩大的研究方向,导致出现新的基准,重点是为研究ML模型的一般特性提供合适的情景。现有基准侧重于监督学习,而我们的知识中,没有未经监督的学习。因此,我们采用一个未经监督的异常检测基准,数据随时间变化,由京都2006年+建立,网络入侵探测的交通数据集。这类数据符合改变输入分布的前提:它涵盖很长的时间(10万美元年),随着时间的流逝而自然发生变化(例如用户改变其行为模式和软件更新)。我们首先强调数据的非静止性质,使用基本的人均分析、t-SNE和最佳运输方法,以测量各年之间的总体分布距离。我们提议AnoShifft,协议将IID、NEAR和FAR测试中的数据分解。我们用从古典方法到深层次学习的不同模型来验证业绩恶化情况。最后,我们通过承认平均的分布问题和数据传播方式,我们用相同的数据转换为可正确分析数据。