Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.
翻译:最近深层学习的进展在很大程度上依赖于大型变异器的使用,因为它们具有大规模学习的能力。然而,变异器的核心构件,即关注操作者,显示序列长度的四倍成本,限制环境的可获取性。基于低位和稀疏近近点的现有次二次反向方法需要与密集的注意层相结合,以匹配变异器,表明能力差距。在这项工作中,我们提议Hyena,一个次二次下降式下降取代器,由内向的隐含平衡的长期变异器和数据控制的格状所构建的注意。回顾和推理数千至数十万个象征物序列的任务,Hyena将依赖州空间的操作者的准确度提高50多点,其他隐含和直截面的方法,与关注模型相匹配。我们在标准数据集(WikiText103和The Pile)的语言建模中设置了一个新的无浓度结构状态,以达到变异器质量,在顺序长度2K所需的培训减少20%。 Hyena操作者在高度最短的顺序上,以2K速度快速的注意速度为8次。