Recently, language representation techniques have achieved great performances in text classification. However, most existing representation models are specifically designed for English materials, which may fail in Chinese because of the huge difference between these two languages. Actually, few existing methods for Chinese text classification process texts at a single level. However, as a special kind of hieroglyphics, radicals of Chinese characters are good semantic carriers. In addition, Pinyin codes carry the semantic of tones, and Wubi reflects the stroke structure information, \textit{etc}. Unfortunately, previous researches neglected to find an effective way to distill the useful parts of these four factors and to fuse them. In our works, we propose a novel model called Moto: Enhancing Embedding with \textbf{M}ultiple J\textbf{o}int Fac\textbf{to}rs. Specifically, we design an attention mechanism to distill the useful parts by fusing the four-level information above more effectively. We conduct extensive experiments on four popular tasks. The empirical results show that our Moto achieves SOTA 0.8316 ($F_1$-score, 2.11\% improvement) on Chinese news titles, 96.38 (1.24\% improvement) on Fudan Corpus and 0.9633 (3.26\% improvement) on THUCNews.
翻译:最近,语言表现技术在文本分类方面表现良好,但是,大多数现有代表模式都专门为英文材料设计,由于这两种语言之间差异巨大,中文材料可能无法使用。实际上,在单一层次的中文文本分类处理文本方面,现有方法很少。不过,作为一种特殊的象形文字,中国字符的激进是良好的语义载体。此外,Pininin代码含有调音的语义,Wubi反映了中风结构信息,\ textit{etc}。不幸的是,以往的研究忽略了找到一种有效方法来提炼这四个因素的有用部分并将其融合起来。在我们的工作中,我们提出了一个名为Moto的新模式:用\ textbf{M}muttiple Jtextbf{o}int Fac\ textb{to}rrrs。具体地说,我们设计了一个关注机制,通过更有效地使用上述四级信息来淡化有用部分。我们在四种大众任务上进行了广泛的实验。实证结果显示,我们实现了SOTA 0.83_F_BAR_BAR_BAR_BAR_0.23_BAR1_BAR_BAR_0.23__BAR__BAR_BAR_BAR_BAR_BAR3_0.23_BAR_BAR_0.23___BAR_______BAR_BAR_BAR____________BAR__BAR_______________________________________________________________BAR_BAR_BAR_BAR_BAR__BAR_BAR_BAR_____BAR_BAR_________________________________________________________BAR____________________________