Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments of tokens. Nevertheless, natural units of meaning, such as words or phrases, display varying sizes. To address this mismatch, we equip language models with a dynamic-pooling mechanism, which predicts segment boundaries in an autoregressive fashion. We compare several methods to infer boundaries, including end-to-end learning through stochastic re-parameterisation, supervised learning (based on segmentations from subword tokenizers or spikes in conditional entropy), as well as linguistically motivated boundaries. We perform character-level evaluation on texts from multiple datasets and morphologically diverse languages. The results demonstrate that dynamic pooling, which jointly segments and models language, is often both faster and more accurate than vanilla Transformers and fixed-length pooling within the same computational budget.
翻译:变异器在建模语言中取得非对称的性能,但在内存和时间复杂性方面仍然效率低下。一个可能的补救办法是,通过汇集固定长度的象征部分,减少中间层的序列长度。然而,自然的含义单位,如文字或短语,则显示不同大小。为解决这种不匹配,我们为语言模型配备了动态集合机制,以自动递减的方式预测区段的界限。我们比较了几种推论界限的方法,包括通过随机重新校准、监督学习(以子词符号的分块或有条件的酶状的峰值为基础)以及语言驱动的界限。我们从多个数据集和形态多样性语言对文本进行字符级评价。结果显示,在同一计算预算范围内,动态集合,即组合区段和模型语言,往往比香草变形变换器和固定长度集合更快和更准确。