Attention is a commonly used mechanism in sequence processing, but it is of O(n^2) complexity which prevents its application to long sequences. The recently introduced neural Shuffle-Exchange network offers a computation-efficient alternative, enabling the modelling of long-range dependencies in O(n log n) time. The model, however, is quite complex, involving a sophisticated gating mechanism derived from the Gated Recurrent Unit. In this paper, we present a simple and lightweight variant of the Shuffle-Exchange network, which is based on a residual network employing GELU and Layer Normalization. The proposed architecture not only scales to longer sequences but also converges faster and provides better accuracy. It surpasses the Shuffle-Exchange network on the LAMBADA language modelling task and achieves state-of-the-art performance on the MusicNet dataset for music transcription while being efficient in the number of parameters. We show how to combine the improved Shuffle-Exchange network with convolutional layers, establishing it as a useful building block in long sequence processing applications.
翻译:在序列处理中,人们通常采用注意机制,但这种注意是O(n)2复杂,无法将其应用于长序列。最近推出的神经休克-交换网络提供了一种计算效率高的替代方法,能够模拟O(nlognn)时间的远距离依赖性。但是,该模型相当复杂,涉及由Gated 经常程序股衍生的复杂格子机制。在本文中,我们介绍了一个简单和轻量级的舒夫-交换网络变种,这个网络以使用 GELU 和 层正常化的剩余网络为基础。拟议的结构不仅可以比长序列,而且可以更快地聚集,并且提供更好的准确性。它超过了LAMBAD语言建模任务上的舒夫勒-交换网络,在音乐转录的音乐网数据集上取得了最先进的性能,同时在参数数量上效率很高。我们展示了如何将改进的舒夫- Exchange网络与变异层结合起来,在长序列处理应用中将它建成一个有用的建筑块。