Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
翻译:粗略的专家模型是一个三十年前的概念,在深层学习中重新出现,成为一个流行的架构。这一类架构包括了混合专家、开关变换器、运行网络、BASE层等,都具有每个范例都由一组参数来操作的统一理念。这样,宽度程度将参数计数从计算示例中分离出来,允许极大但有效的模型。由此形成的模型表明,在自然语言处理、计算机视觉和语音识别等不同领域取得了显著改进。我们审查了稀有专家模型的概念,提供了共同算法的基本描述,根据背景介绍了深层次学习时代的进展,并总结了未来工作领域。