Deep learning and hardware for it has garnered immense academic and industry interest in the past 5 years, with many novel proposals. However, the state-of-art remains NVIDIA's TensorCore-based systems that provide top-of-line performance and coverage across a wide-spectrum of deep learning applications. In this paper, we first identify four key problems any new DL solution must solve: 1) Data orchestration, 2) Data movement, 3) Work placement and blending these to achieve 4) Coverage across different types of DL applications. With this as a guide, we propose Violet, a novel architecture with roots in multicore SIMD which balances the responsibilities for these four problems between the architecture, microarchitecture and software stack. Compared to the NVIDIA A100 GPU, we find Violet achieves geo-mean 2.4X/10.6X and 2.1X/9.5X performance/efficiency for inference and training across the MLPerf benchmark suite. We present detailed operator-level analysis of the MLPerf benchmark suite, extracting out key behaviors - with implications for architecture research beyond this paper, that underpin the speedup and efficiency. Overall, this paper motivates the importance of balance, that the break down of responsibilities must be thought through carefully in order to compete with incumbent architecture designs.
翻译:在过去5年中,它的深层次学习和硬件引起了巨大的学术和行业兴趣,并提出了许多新的建议。然而,最先进的NVIDIA的TensorCore系统仍然是NVIDIA的TensorCore系统,它提供最顶尖的性能和覆盖面,覆盖了深层学习应用程序的广泛范围。在本文中,我们首先确定了任何新的DL解决方案必须解决的4个关键问题:(1)数据调试,(2)数据移动,(3)工作安置和混合这些解决方案,以实现(4)涵盖不同类型的DL应用程序。我们提出,通过这个指南,Violet是一个具有多核心SIMD基础的新结构结构,它可以平衡这4个问题在结构、微观建筑和软件堆之间的责任。与NVIDIA A100 GPU相比,我们发现Violet实现了跨MLPerf基准套房的地理-平均值2.4X/10.6X和2.1X/9.5X的性能/效率。我们提出了对MLPerf基准套件的详细操作者级分析,从中提取关键行为,并提取了对结构研究的影响,从而打破了这一结构结构结构中的重要,这必须推动这一结构的快速发展。