Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, unfortunately, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target prediction changes due to reasons ranging from software upgrades to seasonality to changes in user behavior. Mitigating concept drift is thus an essential part of operationalizing machine learning models, and yet despite its importance, concept drift has not been extensively explored in the context of networking -- or regression models in general. Thus, it is not well-understood how to detect or mitigate it for many common network management tasks that currently rely on machine learning models. Unfortunately, as we show, concept drift cannot be sufficiently mitigated by frequently retraining models using newly available data, and doing so can even degrade model accuracy further. In this paper, we characterize concept drift in a large cellular network for a major metropolitan area in the United States. We find that concept drift occurs across many important key performance indicators (KPIs), independently of the model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. To do so, we develop Local Error Approximation of Features (LEAF). LEAF detects drift; explains features and time intervals that most contribute to drift; and mitigates drift using forgetting and over-sampling. We evaluate LEAF against industry-standard mitigation approaches with more than four years of cellular KPI data. Our initial tests with a major cellular provider in the US show that LEAF is effective on a variety of KPIs and models. LEAF consistently outperforms periodic and triggered retraining while reducing costly retraining operations.
翻译:操作网络通常依赖机器学习模式执行许多任务,包括发现异常现象、推断应用性能和预测需求。但不幸的是,模型准确性会因概念的漂移而降低,因为从软件升级到季节性到用户行为变化等原因,这些特征和目标预测变化之间的关系会因从软件升级到季节性到用户行为变化等各种原因而降低。因此,缩小概念漂移是机器学习模式投入运作的一个基本部分,尽管其重要性,但在网络化或一般回归模式方面,概念漂移并没有得到广泛探讨。因此,对于目前依赖机器学习模型的许多共同网络管理任务,模型准确性会降低或减轻它。不幸的是,正如我们所显示的那样,由于经常再培训模式使用新获得的数据,因此无法充分减轻概念漂移,甚至可以进一步降低模型的准确性。在本文中,我们把概念漂移的概念描述为美国主要大都市地区的大型蜂窝网络的漂移。我们发现,概念漂移发生在许多重要的关键业绩指标(KPI)中,而独立于模型、培训规定的规模和时间间隔 -- 因而有必要采取切实可行的方法来探测、解释和减轻、解释、解释、解释、解释和减少成本程度的易变变变变变的系统)。