Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.
翻译:语言模型(PLM)是预先培训的语言模型(PLM)的基础。中国PLM的现有象征化方法通常将每个字符都视为不可分割的象征物。 但是,它们忽略了中文写作系统的独特性, 因为在字符级别之下存在额外的语言信息, 即子字符化级别。 为了使用这些信息, 我们建议了子字符化( SubChar) 符号化。 具体地说, 我们首先将每个中文字符转换成一个基于其格字或发音的短顺序, 从而编码输入文本, 然后根据编码文本构建词汇, 并配有子词段。 实验结果显示 SubChar 符号比现有的符号化具有两个主要优势:1 它们可以将输入符号化成短得多的序列, 从而提高计算效率。 2 以 Pronuciaction为基础的 SubCharmakers 可以将中文同音序列编码, 并产生相同的代号输出, 从而对同音类文字的缩写输出。 同时, 与子计算机代号/ Subhallcomm 执行竞争任务的模式 。