Gradient Boosted Decision Tree (GBDT) is a widely-used machine learning algorithm that has been shown to achieve state-of-the-art results on many standard data science problems. We are interested in its application to multioutput problems when the output is highly multidimensional. Although there are highly effective GBDT implementations, their scalability to such problems is still unsatisfactory. In this paper, we propose novel methods aiming to accelerate the training process of GBDT in the multioutput scenario. The idea behind these methods lies in the approximate computation of a scoring function used to find the best split of decision trees. These methods are implemented in SketchBoost, which itself is integrated into our easily customizable Python-based GPU implementation of GBDT called Py-Boost. Our numerical study demonstrates that SketchBoost speeds up the training process of GBDT by up to over 40 times while achieving comparable or even better performance.
翻译:渐进推进决策树(GBDT)是一种广泛使用的机器学习算法,已被证明可以在许多标准数据科学问题上取得最先进的结果。 当输出具有高度多面性时,我们有兴趣将其应用于多输出问题。 虽然GBDT的落实非常有效,但对于这些问题的可伸缩性仍然不尽人意。在本文中,我们提出了旨在加速多输出情景中GBDT培训进程的新方法。这些方法背后的想法在于对用于找到最佳决策树的评分函数进行大致的计算。这些方法在StcherBoost中实施,而StchBoost本身就被纳入我们易于定制的称为Py-Boost的GBDTPython GPU执行中。我们的数字研究表明,StrachBoost在取得可比较或更好的业绩的同时,将GBDT培训进程加速40多次。