Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning (ML), leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span ($10$ years), with naturally occurring changes over time (eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models, ranging from classical approaches to deep learning. Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical training which assumes independent and identically distributed data (on average, by up to $3\%$ for our approach). Dataset and code are available at https://github.com/bit-ml/AnoShift/.
翻译:数据的分布漂移分析是当今机器学习领域的一个不断发展的研究方向,这导致出现了专门针对机器学习模型通用性研究的全新基准测试。现有的基准测试都专注于监督学习,据我们所知,还没有针对无监督学习的基准测试。因此,我们引入了一种无监督异常检测基准测试,针对Kyoto-2006+网络入侵检测数据集构建在时间轴上的数据。这种数据符合输入分布漂移的前提:它涵盖了一个很长的时间跨度(10年),因此存在自然变化(例如用户修改其行为模式和软件更新)。我们使用基本的逐特征分析,t-SNE和最优输运方法来突出显示数据的非平稳性质,并定义了AnoShift协议,将数据分为IID(独立同分布,即所有数据在同一时间段),NEAR(接近数据,即不同时期但近似分布)和FAR(远离数据,即远离分布)三个测试集。我们使用不同模型对时间性能损耗进行验证,范围从传统方法到深度学习。最后,我们表明,通过认识到分布漂移问题并恰当地解决它,性能可以相对于假设独立同分布的典型训练而得到提高(在平均值上高达3%)。数据集和代码可在https://github.com/bit-ml/AnoShift/上获得。