Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed--and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.
翻译:字形比对编码( BBE) 是语言模型子词符号化过程的无处不在的算法, 因为它能提供多种好处。 但是, 这一过程完全基于培训前的数据统计, 使符号化器难以处理不常见的拼写。 另一方面, 纯字符级模型虽然对拼字错误来说是强大的, 但纯字符级模型往往会导致不合理的长序, 并使模型更难学习有意义的词汇。 为了缓解这些挑战, 我们提议了一个基于字符的子字小字模块( Char2 Subword), 以学习在诸如 BERT 等预先训练模型中嵌入子字表的子字组。 我们的字符2 子字组模块从子字组词汇表的字符中建立代表, 并且可以用作子字组嵌入的拼写表替换。 这个模块对字符级的改变非常有力, 比如拼写错误、 字串、 字形、 弹夹、 外壳和标号。 我们通过培训前将它进一步与 BERT 整合, 同时保持 BERT 变换参数的参数, 提供实用的方法。 最后, 我们展示了将模块纳入到 mBERTERT 语言媒体的功能 基 的功能评估。