Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations (i.e., languages with strong transparency), both autoregressive and masked language models successfully learn to emulate semantic relations between expressions. However, when denotations are changed to be context-dependent with the language otherwise unmodified, this ability degrades. Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not represent natural language semantics well. We show this failure relates to the context-dependent nature of natural language form-meaning mappings.
翻译:目前许多NLP系统都是从经过培训的语文模型中建立的,目的是优化大量原始文本的未经监督的目标。在什么条件下这种程序可能具有意义?我们对合成数据的系统实验表明,所有表达方式都具有背景独立的批注(即具有高度透明度的语文)的语文,自动递减和蒙面语言模型都成功地学习了模仿表达方式之间的语义关系。然而,当批注被改变为取决于背景的语文未经修改时,这种能力会退化。谈到自然语言,我们用一种特定现象 -- -- 优等不透明 -- -- 进行的实验增加了越来越多的证据,表明目前的语言模型并不很好地代表自然语言语义的语义。我们证明,这种失败与自然语言形式图示图的因背景而异的性质有关。</s>