Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). We theoretically predict a width-dependent transition between depth-efficiency and depth-inefficiency in self-attention. We conduct systematic empirical ablations on networks of depths 6 to 48 that clearly reveal the theoretically predicted behaviors, and provide explicit quantitative suggestions regarding the optimal depth-to-width allocation for a given self-attention network size. The race towards beyond 1-Trillion parameter language models renders informed guidelines for increasing self-attention depth and width in tandem an essential ingredient. Our guidelines elucidate the depth-to-width trade-off in self-attention networks of sizes up to the scale of GPT3 (which we project to be too deep for its size), and beyond, marking an unprecedented width of 30K as optimal for a 1-Trillion parameter network.
翻译:快速推进自然语言处理前沿的自留结构展示出一种令人惊讶的深度低效率行为:先前的工作表明,增加内部代表(网络宽度)与增加自留层数目(网络深度)一样有用。我们理论上预测,在深度效率和深度低效率之间会发生宽度的过渡,从深度自留自留;我们对深度网络6至48进行系统的经验性推算,清楚显示理论上预测的行为,并就特定自留网络规模的最佳深度至宽度分配提供明确的量化建议。 超越一三千参数语言模型的竞赛为增加自留深度和宽度提供了知情的指导方针,将一个基本要素连在一起。 我们的指导方针阐明了到GPT3规模(我们预测其规模太深)的自留网的深度至宽度交易,以及以后,一个一至三千参数网络的最佳深度为30K,这是前所未有的最理想的宽度。