Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.
翻译:显著的是, 变换器可以处理长序列, 从而能够产生长期一致的产出: GPT-3 或DALL-E 生成的结构完善的图像。 这些大型语言模型令人印象深刻,但也非常低效且成本高昂, 限制了其应用和可访问性。 我们假设, 拥有明确的等级结构是变换器的关键, 能够有效处理长序列。 为了核实这个主张, 我们首先研究不同的方法, 使变换器中下层和上层激活, 以便使其达到等级分级。 我们使用最佳的上层和下层取样, 来创建“ 小时玻璃” —— 一个等级变压器语言模型。 小时玻璃在变换器基线上改进了与变换器相同的计算量, 并且能够产生与变换器更高效的结果。 特别是, 小时玻璃在图像网络32 生成任务上为变换器模型设置了新的状态, 并在广泛研究的 enwik8 基准上提高语言建模效率 。