SHAP (SHapley Additive exPlanation) values provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values. While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap is available for decision tree models. However, despite its polynomial time complexity, TreeShap can become a significant bottleneck in practical machine learning pipelines when applied to large decision tree ensembles. We present GPUTreeShap, a modified TreeShap algorithm suitable for massively parallel computation on graphics processing units. Our approach first preprocesses each decision tree to isolate variable sized sub-problems from the original recursive algorithm, then solves a bin packing problem, and finally maps sub-problems to single-instruction, multiple-thread (SIMT) tasks for parallel execution with specialised hardware instructions. With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19x for SHAP values, and speedups of up to 340x for SHAP interaction values, over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs. We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2M rows per second -- equivalent CPU-based performance is estimated to require 6850 CPU cores.
翻译:SHapley Additivie Explanation) 值提供了基于 Shapely 值的机器学习模型预测的游戏理论解释。 虽然精确计算 SHAP 值在总体上难以计算, 但是对于决策树型模型来说, 有一种叫做 TreamShap 的递归性多边时间算法。 然而, 尽管它具有多式时间复杂性, 但是在应用到大型决策树群时, TreShap 可以成为实用机器学习管道中的重大瓶颈 。 我们提供了 GPUUTreeShap, 一种适用于图形处理单位大规模平行计算的修改的 TeraShap 算法。 我们的第一个方法预处理每个决定树, 将变量大小小问题从原始的递归算算算中分离出来, 然后解决一个垃圾包装问题, 并最终绘制单项、 多轨( SIMT) 任务, 用专门的硬件指令平行执行。 我们的 NVIDIA Tesplue V100-302 GP, 我们为SHAP 的最多19x 等等值, 使用 C- 的 C- 20 C- mill 的 C- c- treal 和 C- tri- trial 执行的 C- tal 20- pros- true- true- true- sal- trial- sal- sal- sal 20x 20- trupal- sal- sal- sal- supal- supal- suput 20- supal- supal- supal 。