Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word tokenization. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to all homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code at https://github.com/thunlp/SubCharTokenization to facilitate future work.
翻译:语言模型(PLM)是预先培训的语言模型(PLM)的基础。中国PLM的现有象征化方法通常将每个字符视为不可分割的象征。 但是,它们忽略了中文写作系统的独特性, 即, 在字符级之下存在额外的语言信息。 为了使用这些信息, 我们建议了子字符化( SubChar) 符号化。 具体地说, 我们首先将每个中文字符转换成一个基于其格字或发音的短顺序, 从而编码输入文本, 然后根据编码文本以子词符号化为基础构建词汇。 实验结果显示, SubChar 符号比现有的符号化具有两个主要优势:1 它可以将输入符号化成短得多的序列, 从而提高计算效率。 2 以 Pronuciaction为基础的SubChar象征化器可以将中国同音音序列编码成相同的转音序列, 并产生相同的代号输出, 从而对所有同声化的缩写。 同时, 与 SubCharmaintimers 执行竞争性任务/ Subharprodustrual-ximcenfornal work.