GlyphCRM: 中文本的双向编码器 (GlyphCRM: Bidirectional Encoder Representation for Chinese Character with its Glyph)

Previous works indicate that the glyph of Chinese characters contains rich semantic information and has the potential to enhance the representation of Chinese characters. The typical method to utilize the glyph features is by incorporating them into the character embedding space. Inspired by previous methods, we innovatively propose a Chinese pre-trained representation model named as GlyphCRM, which abandons the ID-based character embedding method yet solely based on sequential character images. We render each character into a binary grayscale image and design two-channel position feature maps for it. Formally, we first design a two-layer residual convolutional neural network, namely HanGlyph to generate the initial glyph representation of Chinese characters, and subsequently adopt multiple bidirectional encoder Transformer blocks as the superstructure to capture the context-sensitive information. Meanwhile, we feed the glyph features extracted from each layer of the HanGlyph module into the underlying Transformer blocks by skip-connection method to fully exploit the glyph features of Chinese characters. As the HanGlyph module can obtain a sufficient glyph representation of any Chinese character, the long-standing out-of-vocabulary problem could be effectively solved. Extensive experimental results indicate that GlyphCRM substantially outperforms the previous BERT-based state-of-the-art model on 9 fine-tuning tasks, and it has strong transferability and generalization on specialized fields and low-resource tasks. We hope this work could spark further research beyond the realms of well-established representation of Chinese texts.

翻译：先前的作品显示, 中国字符的格字含有丰富的语义信息, 并有可能增强中国字符的表达面。使用格字特征的典型方法是将这些特征纳入字符嵌入空间。在以往方法的启发下, 我们创新地提出一个中国预培训的代言模式, 名为 GlyphCRM, 这个模式放弃基于ID的字符嵌入方法, 而仅仅以相继字符图像为基础。我们将每个字符转换成二进制灰度图像, 并设计双声道位置图示。形式上, 我们首先设计一个两层残余的同流神经网络, 即 HanGlyph, 以生成中国字符的初始格字面表达面, 并随后采用多个双向编码变码转换器块作为获取对背景敏感信息的超级结构。与此同时, 我们通过跳出模型将每个字符的格相连接方式, 来充分利用中国字符的低频谱特征。由于 HanGlyph 模块可以在任何中国域域域的精度上获得一个清晰的图像代表面, 并且能够有效地在任何中国域域域域域域域域域上的高级翻校外显示, 。

相关内容