Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains including natural language processing and computer vision. To further improve their accuracy, recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert (MoE) layers. In this paper, we explore the introduction of MoE layers to optimize a different metric: inference latency. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely-activated version of the original network that tries to meet the latency target while maintaining baseline accuracy. We evaluate PLANER on two real-world language modeling tasks using the Transformer-XL network and achieve inference latency reductions of over 2x at iso-accuracy.
翻译:以变换器为基础的神经网络在一些机器学习领域,包括自然语言处理和计算机视觉,取得了最先进的任务性能,为了进一步提高准确性,最近的工作探索了将动态行为以混合专家(MoE)层的形式纳入这些网络。在本文中,我们探索引入教育部层,以优化不同的衡量标准:推导延缓度。我们引入了一个名为PlanateER的新系统,采用现有的变换器网络和用户定义的延缓目标,并产生一个优化的、很少激活的原始网络版本,力求在保持基线准确性的同时达到延缓目标。我们利用变换器-XL网络对两种现实世界语言建模任务进行评估,并实现在不准确性情况下2x以上值的推推力延缓值减少。