Detecting drifts in data is essential for machine learning applications, as changes in the statistics of processed data typically has a profound influence on the performance of trained models. Most of the available drift detection methods are either supervised and require access to the true labels during inference time, or they are completely unsupervised and aim for changes in distributions without taking label information into account. We propose a novel task-sensitive semi-supervised drift detection scheme, which utilizes label information while training the initial model, but takes into account that supervised label information is no longer available when using the model during inference. It utilizes a constrained low-dimensional embedding representation of the input data. This way, it is best suited for the classification task. It is able to detect real drift, where the drift affects the classification performance, while it properly ignores virtual drift, where the classification performance is not affected by the drift. In the proposed framework, the actual method to detect a change in the statistics of incoming data samples can be chosen freely. Experimental evaluation on nine benchmarks datasets, with different types of drift, demonstrates that the proposed framework can reliably detect drifts, and outperforms state-of-the-art unsupervised drift detection approaches.
翻译:检测数据中的漂移现象对于机器学习应用至关重要,因为经过处理的数据的统计变化通常会对经过培训的模型的性能产生深远影响。大多数现有的漂移探测方法要么受到监督,要求在推断期间使用真实标签,要么完全无人监督,目的是在不考虑标签信息的情况下改变分布;我们提议了一个新颖的、任务敏感、半监督的漂移探测办法,在培训初始模型时使用标签信息,但考虑到在推断期间使用模型时,监督标签信息不再可用。它使用输入数据的受限制的低维嵌入表示方式。这样,它最适合分类任务。它能够探测真正的漂移情况,在漂移影响分类性能时,它适当忽略虚拟漂移情况,而分类性能不受漂移影响。在拟议框架中,可以自由选择检测收到的数据样本统计数据变化的实际方法。对9个基准数据集进行实验性评估,并使用不同类型的漂移,表明拟议框架能够可靠地探测流流动、流动方式和流动外形。