Decision forests, including random forests and gradient boosting trees, remain the leading machine learning methods for many real-world data problems, specifically on tabular data. However, current standard implementations only operate in batch mode, and therefore cannot incrementally update when more data arrive. Several previous works developed streaming trees and ensembles to overcome this limitation. Nonetheless, we found that those state-of-the-art algorithms suffer from a number of drawbacks, including performing very poorly on some problems and requiring a huge amount of memory on others. We therefore developed the simplest possible extension of decision trees we could think of: given new data, simply update existing trees by continuing to grow them, and replace some old trees with new ones to control the total number of trees. On three standard datasets, we illustrate that our approach, Stream Decision Forest (SDF), does not suffer from either of the aforementioned limitations. In a benchmark suite containing 72 classification problems (the OpenML-CC18 data suite), we illustrate that our approach often performs as well, and sometimes better even, than the batch mode random forest algorithm. Thus, we believe SDFs establish a simple standard for streaming trees and forests that could readily be applied to many real-world problems, including those with distribution drift and continual learning.
翻译:决策型森林,包括随机森林和梯度增生树,仍然是许多真实世界数据问题,特别是表格数据问题的主要机械学习方法。然而,目前的标准实施方法仅以批量方式运作,因此无法在更多数据到达时逐步更新。前几部作品开发了串流树木和集合,以克服这一限制。然而,我们发现,这些最先进的算法存在若干缺陷,包括在某些问题上表现很差,需要大量记忆。因此,我们开发了我们所能想象的决策型树的最简单可能的延伸:根据新的数据,简单地通过继续种植现有树木来更新现有树木,用新的树取代一些旧树来控制树木的总数。在三个标准数据集中,我们说明我们的方法,即Sream决定型森林(SDF),没有受到上述任何一种限制。在包含72个分类问题的基准套(OpenML-CC18数据套)中,我们说明我们的方法往往表现得最简单,有时甚至更好,比批量模式随机森林算法。因此,我们认为SDFSDFs建立了一个简单的标准,包括流树和森林流流流流流流流流学的常见问题。