Structured, or tabular, data is the most common format in data science. While deep learning models have proven formidable in learning from unstructured data such as images or speech, they are less accurate than simpler approaches when learning from tabular data. In contrast, modern tree-based Machine Learning (ML) models shine in extracting relevant information from structured data. An essential requirement in data science is to reduce model inference latency in cases where, for example, models are used in a closed loop with simulation to accelerate scientific discovery. However, the hardware acceleration community has mostly focused on deep neural networks and largely ignored other forms of machine learning. Previous work has described the use of an analog content addressable memory (CAM) component for efficiently mapping random forests. In this work, we focus on an overall analog-digital architecture implementing a novel increased precision analog CAM and a programmable network on chip allowing the inference of state-of-the-art tree-based ML models, such as XGBoost and CatBoost. Results evaluated in a single chip at 16nm technology show 119x lower latency at 9740x higher throughput compared with a state-of-the-art GPU, with a 19W peak power consumption.
翻译:结构化数据是数据科学中最常见的格式。深度学习模型在学习如图像或语音等非结构化数据方面已经被证明非常有效,但在从结构化数据中学习时,它们的表现不如简单的机器学习方法准确。相比之下,现代的基于树的机器学习模型在从结构化数据中提取相关信息方面非常出色。在数据科学中,降低模型推理延迟是一个基本要求,例如当模型与模拟相结合以加速科学发现时。然而,硬件加速社区主要关注深度神经网络,并且在其他形式的机器学习方面基本上被忽略了。先前的工作描述了使用模拟内容寻址存储器(CAM)组件来有效映射随机森林。在这项工作中,我们专注于一种实现新型高精度模拟CAM和可编程网络芯片的模拟-数字总体架构,允许推理最先进的基于树的机器学习模型,如XGBoost和CatBoost。在16纳米技术的单芯片中评估的结果显示,与现有最先进的GPU相比,延迟降低了119倍、吞吐量提高了9740倍,峰值功耗为19瓦。