Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, specifically on tabular data. However, current standard implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this limitation. Nonetheless, we found that those state-of-the-art algorithms suffer from a number of drawbacks, including poor performance on some problems and high memory usage on others. We therefore developed the simplest possible extension of decision trees we could think of: given new data, simply update existing trees by continuing to grow them, and replace some old trees with new ones to control the total number of trees. On three standard datasets, we illustrate that our approach, Stream Decision Forest (SDF), does not suffer from either of the aforementioned limitations. In a benchmark suite containing 72 classification problems (the OpenML-CC18 data suite), we illustrate that our approach often performs as well, and sometimes better even, than the batch mode decision forest algorithm. Thus, SDFs establish a simple standard for streaming trees and forests that could readily be applied to many real-world problems, including those with distribution drift and continual learning.
翻译:包括随机森林和梯度增生树在内的决策型森林,仍然是许多真实世界数据问题,特别是表格数据问题的主要机械学习方法,但目前的标准执行方法仅以批量方式运作,因此无法在更多数据到达时逐步更新。以前的一些著作开发了流树和集合,以克服这一限制。然而,我们发现,这些最先进的算法存在一些缺陷,包括一些问题的性能差和对其他问题的记忆力高。因此,我们开发了我们所能想象的决策型树的最简单可能的延伸:根据新数据,仅仅通过继续种植现有树木来更新现有树木,用新的树取代一些旧树来控制树木的总数。在三个标准数据集中,我们说明我们的方法,即Stream决定森林(SDF),没有受到上述限制中任何一个的损害。在包含72个分类问题的基准套件(OpenML-CC18数据集)中,我们说明我们的方法往往比批量模式决定型森林算法要好,有时甚至更好。因此,SDFS为流树和森林的不断流化问题制定了一个简单的标准,包括不断流树和流地向世界的流传的问题。