The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
翻译:然而,关于数据流背景下的机器学习的文献数量巨大,而且正在增长;然而,关于数据流学习任务的许多定义性假设过于强大,无法在实践中维持,甚至相互矛盾,因此无法在有监督的学习背景下满足这些任务;由于问题环境没有明确界定,在不现实的环境中测试,以及(或)与更广泛的文献中的相关方法脱节,因此,在选择和设计这些分类时往往没有明确说明标准,因为问题环境没有明确界定,没有在不现实的环境中和(或)脱离相关方法。 这就使人们对在这种背景下设想的许多方法对现实世界产生影响的潜在可能性产生怀疑,并有可能日益传播错误的研究重点。 我们提议通过重新确定关于当代概念流和时间依赖性考虑的受监督的数据流学习的基本定义和背景来解决这些问题;我们重新审视什么是监督的数据流学习任务,从单一的或在线学习概念学习的基本定义和背景,因此,从研究领域到特定的流学研究领域,从一个单一的或在线的学习模式到任何特定的学习周期,我们提出建议。