Monitoring the behavior of automated real-time stream processing systems has become one of the most relevant problems in real world applications. Such systems have grown in complexity relying heavily on high dimensional input data, and data hungry Machine Learning (ML) algorithms. We propose a flexible system, Feature Monitoring (FM), that detects data drifts in such data sets, with a small and constant memory footprint and a small computational cost in streaming applications. The method is based on a multi-variate statistical test and is data driven by design (full reference distributions are estimated from the data). It monitors all features that are used by the system, while providing an interpretable features ranking whenever an alarm occurs (to aid in root cause analysis). The computational and memory lightness of the system results from the use of Exponential Moving Histograms. In our experimental study, we analyze the system's behavior with its parameters and, more importantly, show examples where it detects problems that are not directly related to a single feature. This illustrates how FM eliminates the need to add custom signals to detect specific types of problems and that monitoring the available space of features is often enough.
翻译:监测自动实时流处理系统的行为已成为现实世界应用中最相关的问题之一。这些系统的复杂性已经增加,严重依赖高维输入数据和数据饥饿机器学习算法。我们提议一个灵活的系统,即地貌监测(FM),以检测这类数据集中的数据漂移情况,并有少量和恒定的内存足迹和小量的计算成本。该方法基于多变量统计测试,由设计驱动的数据(从数据中估算出全部参考分布)。它监测系统使用的所有特征,同时在警报发生时提供可解释的特征排位(协助根源分析)。系统计算和记忆光亮因使用显斑移动直方图而产生的结果。在实验研究中,我们用其参数分析系统的行为,更重要的是,展示其发现与单一特征没有直接关系的问题的实例。这说明调频如何消除了增加定制信号以探测特定类型问题的必要性,并且监测现有地貌空间往往足够。