Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
翻译:分散化的专家网络混合(MoEs)在自然语言处理中表现出了极佳的可缩放性。 然而,在计算机视野中,几乎所有表现的网络都是“密集”的,即每个输入都由每个参数处理。我们展示了一个视野的ME(V-MOE),这是一个稀疏的视野变异器,与最稠密的网络具有可缩放性和竞争力。在应用到图像识别时,V-MoE与最先进的网络的性能相匹配,同时在推论时间需要几乎一半的计算能力。此外,我们提议扩展可在整个批次中优先排列每项输入的子集的路径算法,从而导致适应性的人均图像计算。这样,V-MoE就可以在测试时交换性能并顺利地进行编译。最后,我们展示了V-MoE在缩放视觉模型方面的潜力,并在图像网络上培训一个达到90.35%的15B参数模型。