Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data mining, stream processing algorithms have additional requirements regarding computational resources and adaptability to data evolution. They must process instances incrementally because the data's continuous flow prohibits storing data for multiple passes. Ensemble learning achieved remarkable predictive performance in this scenario. Implemented as a set of (several) individual classifiers, ensembles are naturally amendable for task parallelism. However, the incremental learning and dynamic data structures used to capture the concept drift increase the cache misses and hinder the benefit of parallelism. This paper proposes a mini-batching strategy that can improve memory access locality and performance of several ensemble algorithms for stream mining in multi-core environments. With the aid of a formal framework, we demonstrate that mini-batching can significantly decrease the reuse distance (and the number of cache misses). Experiments on six different state-of-the-art ensemble algorithms applying four benchmark datasets with varied characteristics show speedups of up to 5X on 8-core processors. These benefits come at the expense of a small reduction in predictive performance.
翻译:通常,机器学习应用程序必须应对动态环境,即数据以连续数据流的形式收集,且具有潜在的无限长度和短暂行为。与传统的(批量)数据挖掘相比,流式处理算法在计算资源和数据演变适应性方面有额外要求。它们必须渐进地处理各种情况,因为数据的连续流禁止存储数据以多种传输。结合学习在这个假设中取得了显著的预测性能。作为一组(数个)单个分类器执行的集合自然可以修正,以任务平行的方式进行。然而,用于捕捉概念漂移的渐进式学习和动态数据结构增加了缓存失位,并阻碍了平行主义的好处。本文建议了一种微型连接战略,可以改进多核心环境中流开采的记忆存存访问地点和几个共通算算法的性能。在正式框架的帮助下,我们证明微型连接可以大大减少再利用距离(和缓存误差次数) 。实验了六种不同状态的混合算法,应用四种基准数据集,从而增加了缓冲误差,从而阻碍着平行主义的好处。本文提议了一个微型缓冲战略,可以在8-C进程上显示成本递减缩。