As artificial intelligence (AI) and machine learning (ML) technologies disrupt a wide range of industries, cloud datacenters face ever-increasing demand in inference workloads. However, conventional CPU-based servers cannot handle excessive computational requirements of deep neural network (DNN) models, while GPU-based servers suffer from huge power consumption and high operating cost. In this paper, we present a scalable systolic-vector architecture that can cope with dynamically changing DNN workloads in cloud datacenters. We first devise a lightweight DNN model description format called unified model format (UMF) that enables general model representation and fast decoding in hardware accelerator. Based on this model format, we propose a heterogeneous architecture that features a load balancer that performs a high-level workload distribution and multiple systolic-vector clusters, in which each cluster consists of a programmable scheduler, throughput-oriented systolic arrays, and function-oriented vector processors. We also propose a heterogeneity-aware scheduling algorithm that enables concurrent execution of multiple DNN workloads while maximizing heterogeneous hardware utilization based on computation and memory access time estimation. Finally, we build an architecture simulation framework based on actual synthesis and place-and-route implementation results and conduct design space exploration for the proposed architecture. As a result, the proposed systolic-vector architecture achieves 10.9x higher throughput performance and 30.17x higher energy efficiency than a compatible GPU on realistic ML workloads. The proposed heterogeneity-aware scheduling algorithm improves the throughput and energy efficiency by 81% and 20%, respectively, compared to a standard round-robin scheduling.
翻译:由于人工智能(AI)和机器学习(ML)技术扰乱了广泛的行业,云中数据收集员面临着不断增加的需求。然而,基于常规CPU的服务器无法处理深神经网络(DNN)模型的过度计算要求,而基于GPU的服务器则面临巨大的电力消耗和高运行成本。在本文件中,我们展示了一个可伸缩的 Sy-victor 架构架构,它能够应对云中云中数据中心不断变化的 DNN 工作量变化动态变化。我们首先设计了一个轻量的 DNNN 模型描述模型格式,称为统一模式格式(UMF),它能够提供通用模型代表,并快速解码硬件更符合要求。然而,基于这一模式格式,常规的CPU服务器无法处理深神经网络(DNNN)模型模型,而基于GNNNN(DN)模型格式的服务器不能处理过量计算过多的计算要求,而GNNNNNN(DNNN)模型无法处理深神经网络模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型模型无法满足。我们提出一个混合结构,无法处理器不能处理超过量超过量的超过量的超时,同时执行多DNFLL),而我们建议一个包装的包装的包件(DNF 工作,同时使用G(L) 工作,同时使用G) 工作,同时使用G(L) 數(L) 數(L) 數(L) 數(LOF) 工作效率計价化) 數(L) 數框架。