Mining data streams poses a number of challenges, including the continuous and non-stationary nature of data, the massive volume of information to be processed and constraints put on the computational resources. While there is a number of supervised solutions proposed for this problem in the literature, most of them assume that access to the ground truth (in form of class labels) is unlimited and such information can be instantly utilized when updating the learning system. This is far from being realistic, as one must consider the underlying cost of acquiring labels. Therefore, solutions that can reduce the requirements for ground truth in streaming scenarios are required. In this paper, we propose a novel framework for mining drifting data streams on a budget, by combining information coming from active learning and self-labeling. We introduce several strategies that can take advantage of both intelligent instance selection and semi-supervised procedures, while taking into account the potential presence of concept drift. Such a hybrid approach allows for efficient exploration and exploitation of streaming data structures within realistic labeling budgets. Since our framework works as a wrapper, it may be applied with different learning algorithms. Experimental study, carried out on a diverse set of real-world data streams with various types of concept drift, proves the usefulness of the proposed strategies when dealing with highly limited access to class labels. The presented hybrid approach is especially feasible when one cannot increase a budget for labeling or replace an inefficient classifier. We deliver a set of recommendations regarding areas of applicability for our strategies.
翻译:采矿数据流提出了若干挑战,包括数据的持续和非静止性质、需要处理的大量信息以及计算资源所受到的限制。文献中虽然为这一问题提出了若干受监督的解决方案,但大多数都认为获取地面真相(类标签形式)是无限的,在更新学习系统时可以立即利用这些信息。这种混合方法远非现实,因为必须考虑获取标签的基本成本。因此,需要采用能够减少对流流情景中地面真相要求的解决办法。在本文件中,我们提出一个在预算中挖掘流动数据流的新框架,将积极学习和自我标签产生的信息结合起来。我们采用了若干战略,既可以利用智能实例选择和半监督程序,也可以在更新学习系统时立即利用这种信息。这种混合方法可以有效地探索和利用在现实标签预算内获取数据结构。由于我们的框架工作是包装,因此可以采用不同的学习算法。在现实世界分类中采用一套多样化的实用性战略进行实验性研究,在使用一种高度易动的标签概念时,我们提出的一种高流动性的分类方法不能用来证明一种高度易动性的预算流战略。