Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target to be predicted changes. Mitigating concept drift is an essential part of operationalizing machine learning models in general, but is of particular importance in networking's highly dynamic deployment environments. In this paper, we first characterize concept drift in a large cellular network for a major metropolitan area in the United States. We find that concept drift occurs across many important key performance indicators (KPIs), independently of the model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. We then show that frequent model retraining with newly available data is not sufficient to mitigate concept drift, and can even degrade model accuracy further. Finally, we develop a new methodology for concept drift mitigation, Local Error Approximation of Features (LEAF). LEAF works by detecting drift; explaining the features and time intervals that contribute the most to drift; and mitigates it using forgetting and over-sampling. We evaluate LEAF against industry-standard mitigation approaches (notably, periodic retraining) with more than four years of cellular KPI data. Our initial tests with a major cellular provider in the US show that LEAF consistently outperforms periodic and triggered retraining on complex, real-world data while reducing costly retraining operations.
翻译:然而,模型准确性会因概念的漂移而降低,因为根据概念的漂移,特征与目标之间的关系可以预测变化。减缓概念的漂移是机械学习模式总体运作的重要组成部分,但对于网络高度动态的部署环境尤其重要。在本文件中,我们首先将概念漂移特征定位为美国主要大都市地区大型蜂窝网络中的概念漂移。我们发现,概念漂移发生在许多重要的复杂业绩指标(KPI)中,独立于模型、培训设定的规模和时间间隔之外,因此有必要采取切实可行的方法来探测、解释和减轻这种变化。我们然后表明,利用新获得的数据进行频繁的模型再培训不足以减轻概念漂移,甚至可以进一步降低模型的准确性。最后,我们为概念的漂移减缓、地方误差调整法(LEAFAF)开发了一种新的新方法,通过探测漂移;解释最有助于漂移的特征和时间间隔;以及用遗忘和过久的办法来减轻这种变化。我们用新获得的数据进行频繁的重复性再培训,我们用新的模型进行定期的测试,同时用我们不断对工业标准进行定期的AFAF进行数据测试。