Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, unfortunately, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target prediction changes due to reasons ranging from software upgrades to seasonality to changes in user behavior. Mitigating concept drift is thus an essential part of operationalizing machine learning models, and yet despite its importance, concept drift has not been extensively explored in the context of networking -- or regression models in general. Thus, it is not well-understood how to detect or mitigate it for many common network management tasks that currently rely on machine learning models. As we show, concept drift cannot always be mitigated by periodic retraining models using newly available data, and doing so can even degrade model accuracy. In this paper, we characterize concept drift in a large cellular network for a metropolitan area in the United States. We find that concept drift occurs across key performance indicators (KPIs), regardless of model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. To do so, we develop Local Error Approximation of Features (LEAF). LEAF detects drift; explains features and time intervals that most contribute to drift; and mitigates drift using resampling, augmentation, or ensembling. We evaluate LEAF against industry-standard mitigations (i.e., periodic retraining) with more than three years of cellular data from Verizon. LEAF consistently outperforms periodic retraining on a variety of KPIs and models, while reducing costly retrains by an order of magnitude. Due to its effectiveness, a major cellular carrier is now integrating LEAF into its forecasting and provisioning processes.
翻译:操作网络通常依赖机器学习模式执行许多任务,包括发现异常现象、推断应用性能和预测需求。但不幸的是,模型准确性会因概念的漂移而降低,因为从软件升级到季节性到用户行为变化等原因,这些特点和目标预测变化之间的关系总是无法通过定期再培训模型来减轻,因此,缩小概念漂移是使机器学习模式投入运作的一个基本部分,然而尽管其重要性,概念漂移并没有在网络化或一般回归模型的范围内广泛探索,因此,对于目前依赖机器学习模型的许多共同网络管理任务,模型准确性会降低。正如我们所显示的那样,由于定期再培训模型使用新数据、季节性、甚至降低模型准确性,我们把概念漂移的概念漂移描述成美国大都市地区的大型蜂窝网络。我们发现,概念漂移发生在关键性绩效指标(KPI)中,而不论模式、培训设定规模和时间间隔如何,因此无法很好地探测、解释和减轻其规模,因此,我们不得不用定期的错误性调整和稳定度,同时解释其漂移的特性(LEAAAAAF),我们用定期的频率来减少其流变变变的频率,并测量。