Machine learning has successfully leveraged modern data and provided computational solutions to innumerable real-world problems, including physical and biomedical discoveries. Currently, estimators could handle both scenarios with all samples available and situations requiring continuous updates. However, there is still room for improvement on streaming algorithms based on batch decision trees and random forests, which are the leading methods in batch data tasks. In this paper, we explore the simplest partial fitting algorithm to extend batch trees and test our models: stream decision tree (SDT) and stream decision forest (SDF) on three classification tasks of varying complexities. For reference, both existing streaming trees (Hoeffding trees and Mondrian forests) and batch estimators are included in the experiments. In all three tasks, SDF consistently produces high accuracy, whereas existing estimators encounter space restraints and accuracy fluctuations. Thus, our streaming trees and forests show great potential for further improvements, which are good candidates for solving problems like distribution drift and transfer learning.
翻译:机器学习成功地利用了现代数据,并为无数现实世界问题提供了计算解决方案,包括物理和生物医学发现。目前,估计者可以使用所有现有样本和需要不断更新的情况来处理两种情况。然而,基于批量决定树和随机森林的流算法仍有改进的余地,这是批量数据任务中的主要方法。在本文中,我们探索了最简单的部分适当算法,以扩展批量树木并测试我们的模型:溪流决定树和溪流决定森林(SDF),这三种复杂程度不同的分类任务。关于参考,现有流流树(树和蒙德里安森林)和批量估计者都包括在实验中。在所有三项任务中,SDF始终具有很高的准确性,而现有的估计者则遇到空间限制和准确性波动。因此,我们流树和森林在进一步改进方面有着巨大的潜力,它们是解决分布流和转移学习等问题的良好选择。