Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.
翻译:大型变换模型在许多任务上产生令人印象深刻的结果,但培训甚至微调费用昂贵,而且解码速度太慢,以致其使用和研究无法达到。我们通过利用宽度来解决这个问题。我们研究变换器所有层的稀释变异物,并提议“变换器”,这是下一代变换器的组合,利用稀释层来有效扩展规模,并在扩大模型规模时比标准变换器快得多。令人惊讶的是,稀释层足以获得与标准变换器相同的复杂度,参数数量相同。我们还与先前的变换器整合,以引起注意,并使得即使记忆有限也能快速推断长序列。这导致在长的文本总和化上,其性能与最新工艺相比具有竞争力。