Chinese characters carry a wealth of morphological and semantic information; therefore, the semantic enhancement of the morphology of Chinese characters has drawn significant attention. The previous methods were intended to directly extract information from a whole Chinese character image, which usually cannot capture both global and local information simultaneously. In this paper, we develop a stroke-based autoencoder(SAE), to model the sophisticated morphology of Chinese characters with the self-supervised method. Following its canonical writing order, we first represent a Chinese character as a series of stroke images with a fixed writing order, and then our SAE model is trained to reconstruct this stroke image sequence. This pre-trained SAE model can predict the stroke image series for unseen characters, as long as their strokes or radicals appeared in the training set. We have designed two contrasting SAE architectures on different forms of stroke images. One is fine-tuned on existing stroke-based method for zero-shot recognition of handwritten Chinese characters, and the other is applied to enrich the Chinese word embeddings from their morphological features. The experimental results validate that after pre-training, our SAE architecture outperforms other existing methods in zero-shot recognition and enhances the representation of Chinese characters with their abundant morphological and semantic information.
翻译:中国字符含有大量的形态学和语义学信息; 因此, 中国字符形态的语义强化引起了人们的极大注意。 先前的方法旨在直接从整个中国字符图像中提取信息, 通常无法同时捕捉全球和地方信息。 在本文中, 我们开发了一台中风自动编码器( SAE), 以自我监督的方法模拟中国字符的复杂形态。 遵循其抽象的写作顺序, 我们首先将中国字符作为中风图像系列以固定的写作顺序表示, 然后我们SAE 模型被训练来重建中风图像序列。 这个经过预先训练的SAE 模型可以预测中风图像序列, 只要这些中风或激进人物出现在成套培训中。 我们设计了两种对中风图像形式进行对比的SAE 结构。 其中一种是对现有中风模型进行精确调整, 用于对手写中国字符的零光识别, 另一种则用于从其形态特征中丰富中文词的嵌入。 实验结果验证了在培训前、 SAE 结构中提升了中国正形 格式 。