Learning to predict masked tokens in a sequence has been shown to be a powerful pretraining objective for large-scale language models. After training, such masked language models can provide distributions of tokens conditioned on bidirectional context. In this short draft, we show that such bidirectional conditionals often demonstrate considerable inconsistencies, i.e., they can not be derived from a coherent joint distribution when considered together. We empirically quantify such inconsistencies in the simple scenario of bigrams for two common styles of masked language models: T5-style and BERT-style. For example, we show that T5 models often confuse its own preference regarding two similar bigrams. Such inconsistencies may represent a theoretical pitfall for the research work on sampling sequences based on the bidirectional conditionals learned by BERT-style MLMs. This phenomenon also means that T5-style MLMs capable of infilling will generate discrepant results depending on how much masking is given, which may represent a particular trust issue.
翻译:学习按顺序预测蒙面符号被证明是大型语言模型的强大培训前目标。 培训后, 这种蒙面语言模型可以提供以双向环境为条件的象征分布。 在这份简短的草稿中,我们表明,这种双向条件往往显示出相当大的不一致, 也就是说, 当一起考虑时, 无法从连贯的联合分配中得出。 我们从经验上量化了两种蒙面语言模式( T5 式和 BERT 型)在大号的简单假设中存在的不一致之处。 例如, 我们显示, T5 模式往往混淆了它自己对两个相似大号的偏好。 这种不一致可能代表基于德国- 德国- 德国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 英国- 法国语言模型所 所学的双向条件进行的抽样研究的序列的理论。 。 这一现象还意味着, 能够填充的T5 式 MLMMs 将产生不同式Ms 的混混混和混混和混和混解结果将产生结果产生不同结果将产生不同的结果结果结果结果结果结果结果结果结果结果产生相异。