In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. This achievement represents a significant step towards more efficient and scalable deep learning models for sequence modeling. The code is available at https://github.com/Karami-m/orchid.
翻译:在快速发展的深度学习领域,对兼具表达能力和计算效率的模型的需求从未如此迫切。本文介绍了Orchid,一种新颖的架构,旨在解决传统注意力机制的二次复杂度问题,同时不损害捕获长程依赖关系和上下文学习的能力。该架构的核心是一个新的数据依赖全局卷积层,它通过一个专用的条件神经网络,根据输入序列上下文自适应地调整其卷积核。我们设计了两种简单的条件网络,以在我们的数据依赖卷积操作中保持平移等变性。所提出的卷积核的动态特性赋予了Orchid高表达能力,同时保持了对于长序列的拟线性可扩展性。我们在多个领域评估了所提出的模型,包括语言建模和图像分类,以突显其性能和通用性。我们的实验表明,该架构不仅以更小的模型规模超越了传统的基于注意力的架构(如BERT和Vision Transformers),而且将可行的序列长度扩展到了密集注意力层的限制之外。这一成就代表了为序列建模开发更高效、更可扩展的深度学习模型的重要一步。代码可在 https://github.com/Karami-m/orchid 获取。