Data stream classification is an important problem in the field of machine learning. Due to the non-stationary nature of the data where the underlying distribution changes over time (concept drift), the model needs to continuously adapt to new data statistics. Stream-based Active Learning (AL) approaches address this problem by interactively querying a human expert to provide new data labels for the most recent samples, within a limited budget. Existing AL strategies assume that labels are immediately available, while in a real-world scenario the expert requires time to provide a queried label (verification latency), and by the time the requested labels arrive they may not be relevant anymore. In this article, we investigate the influence of finite, time-variable, and unknown verification delay, in the presence of concept drift on AL approaches. We propose PRopagate (PR), a latency independent utility estimator which also predicts the requested, but not yet known, labels. Furthermore, we propose a drift-dependent dynamic budget strategy, which uses a variable distribution of the labelling budget over time, after a detected drift. Thorough experimental evaluation, with both synthetic and real-world non-stationary datasets, and different settings of verification latency and budget are conducted and analyzed. We empirically show that the proposed method consistently outperforms the state-of-the-art. Additionally, we demonstrate that with variable budget allocation in time, it is possible to boost the performance of AL strategies, without increasing the overall labeling budget.
翻译:在机器学习领域,数据流分类是一个重要问题。由于数据的非固定性质,基本分布随时间而变化(概念漂移),模型需要不断适应新的数据统计。基于流基积极学习(AL)方法通过互动询问一位人类专家,在有限预算范围内为最新样本提供新的数据标签来解决这个问题。现有的AL战略假定标签可以立即提供,而在现实世界情景中,专家需要时间提供查询标签(核查时),而所要求的标签可能不再相关。在本篇文章中,我们调查有限、可变和未知的核查延迟的影响,因为存在AL方法上的概念漂移。我们建议PROPAGate(PR),即一个隐含独立用途估算器,它也预测所要求的,但尚不为人所知的标签。此外,我们建议采用一种基于漂移的动态动态预算战略,在检测到总体漂移之后,使用标签预算的可变的分布方式。我们研究了有限、可变的实验性评估,在合成和真实的预算模式下,我们不断分析预算结构中,我们进行了不同的分析。