The Transformer and its variants have been proven to be efficient sequence learners in many different domains. Despite their staggering success, a critical issue has been the enormous number of parameters that must be trained (ranging from $10^7$ to $10^{11}$) along with the quadratic complexity of dot-product attention. In this work, we investigate the problem of approximating the two central components of the Transformer -- multi-head self-attention and point-wise feed-forward transformation, with reduced parameter space and computational complexity. We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations. Taking advantage of an analogy between Transformer stages and the evolution of a dynamical system of multiple interacting particles, we formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers. We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks. We observe that the degree of approximation (or inversely, the degree of parameter reduction) has different effects on the performance, depending on the task. While in the encoder-decoder regime, TransEvolve delivers performances comparable to the original Transformer, in encoder-only tasks it consistently outperforms Transformer along with several subsequent variants.
翻译:变异器及其变异器已被证明是许多不同领域的高效序列学习者。 尽管它们取得了惊人的成功, 一个关键问题是必须培训的大量参数( 从10美7美元到10美11美元不等 ), 以及圆点产品关注的二次复杂程度。 在这项工作中, 我们调查了变异器的两个核心组成部分 -- -- 多头自留和点向向前进的转换, 其参数空间和计算复杂性减少。 我们利用了在分析深神经网络作为普通差异方程式的数字解析器方面的最新进展。 利用变异器阶段和多相互作用粒子动态系统进化之间的类比, 我们制定了一个时间进化方案, TransEvolve, 绕过多堆叠层的昂贵的圆点产品关注。 我们在已知的变异器上进行了彻底实验, 以及只进行编码任务。 我们观察到, 近似( 或相反的, 参数递减程度) 与变异式系统的业绩有着不同的影响, 取决于变异式任务。