Recently published graph neural networks (GNNs) show promising performance at social event detection tasks. However, most studies are oriented toward monolingual data in languages with abundant training samples. This has left the more common multilingual settings and lesser-spoken languages relatively unexplored. Thus, we present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams. The first exploit is to make the GNN work with multilingual data. For this, we outline a construction strategy that aligns messages in different languages at both the node and semantic levels. Relationships between messages are established by merging entities that are the same but are referred to in different languages. Non-English message representations are converted into English semantic space via the cross-lingual word embeddings. The resulting message graph is then uniformly encoded by a GNN model. In special cases where a lesser-spoken language needs to be detected, a novel cross-lingual knowledge distillation framework, called CLKD, exploits prior knowledge learned from similar threads in English to make up for the paucity of annotated data. Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
翻译:最近出版的图表神经网络(GNNs)显示社会事件探测任务有良好的表现。然而,大多数研究都倾向于以具有大量培训样本的语言提供单一语言的数据。这使得更常见的多语种背景和较少使用的语言相对没有探索。因此,我们提出了一个GNNN, 其中包含了跨语言嵌入在多语言数据流中探测事件的信息。首先利用的是让GNN用多语种数据开展工作。为此,我们概述了一个构建战略,在节点和语义层面将信息与不同语言的信息统一起来。信息之间的关系是由相同但以不同语言引用的合并实体建立的。通过跨语言嵌入,将非英语信息表达方式转换为英语的语义空间。由此产生的信息图形随后由GNNM模型统一编码。在需要检测较低语言的特殊情况下,我们提出了一个新型的跨语言知识蒸馏框架,称为CLKD,利用了从类似英语中学习到的先前知识,从而弥补了注释性数据的缺乏性。在合成和现实世界中,在高语言的样本中,在高语言中实验中,在高语言中都显示有效的数据。