The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding. In this paper, we analyze bottlenecks in state-of-the-art vision and natural language processing (NLP) models, including EfficientNet and BERT, and use FAST to design accelerators capable of addressing these bottlenecks. FAST-generated accelerators optimized for single workloads improve Perf/TDP by 3.7x on average across all benchmarks compared to TPU-v3. A FAST-generated accelerator optimized for serving a suite of workloads improves Perf/TDP by 2.4x on average compared to TPU-v3. Our return on investment analysis shows that FAST-generated accelerators can potentially be practical for moderate-sized datacenter deployments.
翻译:快速变化的深层学习景观为建立为特定数据中心工作量优化的发酵加速器提供了一次独特的机会。 我们提议了全式加速器搜索技术(FAST),这是一个硬件加速器搜索框架,它界定了广泛的优化环境,涵盖硬件软件堆中的关键设计决定,包括硬件数据解析、软件排期、以及诸如操作聚合和压垫等编译器通行证。在本文中,我们分析了最先进的视觉和自然语言处理模型(NLP)的瓶颈,包括高效网和BERT,并使用FAST设计能够解决这些瓶颈的加速器。 FAST生成的加速器优化了单一工作量的加速器,与TPU-V3.相比,在所有基准中平均3.7x 改进 Perf/TDP。 AFST生成的自动加速器,为提供一套工作量组合的优化,使 Perf/TDP平均改进2.4x,与TPU-V3. 我们的投资回报表明,FAST生成的加速器可能为中度数据部署提供中度数据。