Decision forest, including RandomForest, XGBoost, and LightGBM, is one of the most popular machine learning techniques used in many industrial scenarios, such as credit card fraud detection, ranking, and business intelligence. Because the inference process is usually performance-critical, a number of frameworks were developed and dedicated for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. However, these frameworks are all decoupled with data management frameworks. It is unclear whether in-database inference will improve the overall performance. In addition, these frameworks used different algorithms, optimization techniques, and parallelism models. It is unclear how these implementations will affect the overall performance and how to make design decisions for an in-database inference framework. In this work, we investigated the above questions by comprehensively comparing the end-to-end performance of the aforementioned inference frameworks and netsDB, an in-database inference framework we implemented. Through this study, we identified that netsDB is best suited for handling small-scale models on large-scale datasets and all-scale models on small-scale datasets, for which it achieved up to hundreds of times of speedup. In addition, the relation-centric representation we proposed significantly improved netsDB's performance in handling large-scale models, while the model reuse optimization we proposed further improved netsDB's performance in handling small-scale datasets.
翻译:决策森林,包括随机Forest、XGBoost和LightGBM,是许多工业情景中使用的最受欢迎的机器学习技术之一,如信用卡欺诈检测、排名和商业情报。由于推断过程通常对业绩至关重要,因此制定了一些框架,专门用于决策森林推断,如ONNX、亚马逊树林、谷歌TensorFlow决定森林、微软的Humming Bird、Nvidia FIL和lift等。然而,这些框架都与数据管理框架脱钩。不清楚数据库中的推论是否会改善总体业绩。此外,这些框架使用了不同的算法、优化技术和平行模型。这些执行将如何影响总体业绩以及如何设计数据库内推论框架。在这项工作中,我们通过全面比较上述的推论框架和网络数据库的终端性能。我们所实施的数据库中改进了系统内部推论框架。通过这项研究,我们查明,在大规模数据流模型中,所有网络代表模型的大规模性能将如何影响总体业绩,以及如何设计出一个最适合我们所实施的大规模数据数据库。