The predominant approach for language modeling is to process sequences from left to right, but this eliminates a source of information: the order by which the sequence was generated. One strategy to recover this information is to decode both the content and ordering of tokens. Existing approaches supervise content and ordering by designing problem-specific loss functions and pre-training with an ordering pre-selected. Other recent works use iterative search to discover problem-specific orderings for training, but suffer from high time complexity and cannot be efficiently parallelized. We address these limitations with an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data -- no domain knowledge required. The learner contains an encoder network and decoder language model that perform variational inference with autoregressive orders (represented as permutation matrices) as latent variables. The corresponding ELBO is not differentiable, so we develop a practical algorithm for end-to-end optimization using policy gradients. We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass. Permutations then serve as target generation orders for training an insertion-based Transformer language model. Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
翻译:语言建模的主要方法是处理从左到右的顺序,但这消除了一个信息来源:生成序列的顺序。 恢复这一信息的策略之一是解码内容和订单。 现有方法通过设计特定问题的损失功能来监督内容和排序, 并用预选的顺序进行预选培训。 其他最近的作品使用迭代搜索来发现特定问题的培训订单, 但有高时间复杂性, 无法有效平行化。 我们用一个不受监督的平行学习者来处理这些限制, 该学习者只从培训数据中发现高质量的生成订单 -- -- 没有域知识需要。 学习者包含一个编码网络和解码语言模型, 以自动递增命令( 以变换矩阵形式表示) 作为潜在变量进行变异性推断。 相应的 ELBO 并不不同, 因此我们用政策梯度来为端到端优化开发一个实用的算法。 我们用一个不以因果关系为主的变换器, 将输出通过一个前方模式 -- -- 不需要域知识。 Permutationalationations 等语言模型, 然后用来以更具有竞争力的变换语言排序, 将一个目标的顺序作为生成程序, 将一个更精确的变换语言的顺序用于演示。