Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches, instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.
翻译:语言模型由有限的一组输入来定义, 这会在我们试图扩大所支持的语言数量时造成词汇瓶颈。 处理这个瓶颈导致嵌入矩阵和输出层的计算问题中能够代表的内容之间的权衡。 本文引入了 PIXEL, 语言的像素编码器 Pixel, 没有任何这些问题。 PIXEL 是一种预先培训的语言模型, 使文本成为图像, 使得我们有可能根据正方形相似性或像素的共活性, 在不同语言之间传递演示。 PIXEL 受过训练, 以重建隐藏的补丁的像素, 而不是预测符号的分布 。 我们预选了86M 参数 PIXEL 模型, 与 BERT 相同, 并且对类型多样的语言, 包括各种非拉丁文脚本, 发现 PIXEL 大大超越了在脚本上的拼写和语处理任务。 在稍弱的脚本中, 我们发现, 与静的脚本BIX 相比, 在稍弱的脚本中, 我们找到了比静的脚本中找到的 PIX 。