Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each token's string representation. We probe the embedding layer of pretrained language models and show that models learn the internal character composition of whole word and subword tokens to a surprising extent, without ever seeing the characters coupled with the tokens. Our results show that the embedding layer of RoBERTa holds enough information to accurately spell up to a third of the vocabulary and reach high average character ngram overlap on all token types. We further test whether enriching subword models with additional character information can improve language modeling, and observe that this method has a near-identical learning curve as training without spelling-based enrichment. Overall, our results suggest that language modeling objectives incentivize the model to implicitly learn some notion of spelling, and that explicitly teaching the model how to spell does not appear to enhance its performance on such tasks.
翻译:标准预先培训语言模式运行于子字标牌序列上, 无法直接访问构成每个符号字符的字符串代表。 我们检测了预先培训语言模型的嵌入层, 并显示模型在令人惊讶的程度上学习了整个单词和子字标牌的内部字符构成, 从未看到过字符和符号。 我们的结果表明, RoBERTA 嵌入层拥有足够信息, 准确拼写到词汇的三分之一, 在所有符号类型上达到高平均值字符重叠 。 我们进一步测试用额外字符信息来充实子字标本模型是否能改进语言模型, 并观察这种方法在培训方面有着接近同义的学习曲线, 无需基于拼写的内容丰富。 总体而言, 我们的结果表明, 语言模型目标激励模型隐性地学习某些拼写概念, 明确教授拼写模式似乎不会提高这些任务的业绩 。