The task of Chinese text spam detection is very challenging due to both glyph and phonetic variations of Chinese characters. This paper proposes a novel framework to jointly model Chinese variational, semantic, and contextualized representations for Chinese text spam detection task. In particular, a Variation Family-enhanced Graph Embedding (VFGE) algorithm is designed based on a Chinese character variation graph. The VFGE can learn both the graph embeddings of the Chinese characters (local) and the latent variation families (global). Furthermore, an enhanced bidirectional language model, with a combination gate function and an aggregation learning function, is proposed to integrate the graph and text information while capturing the sequential information. Extensive experiments have been conducted on both SMS and review datasets, to show the proposed method outperforms a series of state-of-the-art models for Chinese spam detection.
翻译:中文文本垃圾邮件检测任务非常艰巨,因为中文字符的字形和音频变异。本文件提出了一个新型框架,以共同模拟中文文本垃圾邮件检测任务的中国变异、语义和背景化演示。特别是,根据中文字符变异图设计了家庭强化图形嵌入式嵌入式(VFGE)算法。 VFGE可以学习中国字符(本地)和潜在变异型(全球)的图形嵌入式(图形嵌入式 ) 。此外,还提议采用强化的双向语言模型,结合门功能和汇总学习功能,将图表和文本信息结合,同时捕捉相继信息。已经在SMS和审查数据集方面进行了广泛的实验,以显示拟议方法优于中国垃圾邮件检测的一系列最新模型。