Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication units resulting in a deterministic tail latency of 130 $\mu$s for a batch-1 inference through BERT-base, which is 6X faster than the current state-of-the-art.
翻译:变压器已成为一个主要的机器学习工作量,它们不仅是自然语言处理任务的实际标准,而且还被部署在视觉和语音识别等其他领域。许多变压器应用程序都是实时系统,如机器翻译和网络搜索。这些实时系统往往具有严格的端到端推导延迟度要求。不幸的是,虽然大多数变压器计算来自矩阵乘数,但变压器还包括一些非线性部件,这些部件在推论过程中往往成为瓶颈。在这项工作中,我们加快了发声流处理器中BERT模型的推论速度。通过仔细将所有非线性部件与矩阵倍增组件混在一起,我们得以有效地利用电动矩阵倍增器,导致通过BERT基准进行分批一推算的确定性尾部值为130 $\ mu$,这比目前的状态快6x。