Recently published graph neural networks (GNNs) show promising performance at social event detection tasks. However, most studies are oriented toward monolingual data in languages with abundant training samples. This has left the more common multilingual settings and lesser-spoken languages relatively unexplored. Thus, we present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams. The first exploit is to make the GNN work with multilingual data. For this, we outline a construction strategy that aligns messages in different languages at both the node and semantic levels. Relationships between messages are established by merging entities that are the same but are referred to in different languages. Non-English message representations are converted into English semantic space via the cross-lingual word embeddings. The resulting message graph is then uniformly encoded by a GNN model. In special cases where a lesser-spoken language needs to be detected, a novel cross-lingual knowledge distillation framework, called CLKD, exploits prior knowledge learned from similar threads in English to make up for the paucity of annotated data. Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
翻译:最近发表的图神经网络(GNN)在社交事件检测任务上展现了很好的表现。然而,大多数研究都面向具有丰富的训练样本的语言的单语数据。这使得更常见的多语言设置和较少使用的语言相对未被探索。因此,我们提出了一种GNN,它结合了跨语言词嵌入以检测多语言数据流中的事件。第一个利用是使GNN适用于多语言数据。为此,我们提出了一种构建策略,可以将来自不同语言的消息在节点和语义级别上进行对齐。在不同语言中引用相同但名称不同的实体时,建立消息之间的关系。通过跨语言词嵌入将非英语消息表示转换为英语语义空间。然后,GNN模型对所得到的消息图进行统一编码。在需要检测较少使用的语言时,使用一种新型的跨语言知识蒸馏框架,称为CLKD,通过利用从英语中类似主题中学到的先前知识来弥补注释数据的缺乏。在合成和真实数据集上的实验证明,该框架在多语言数据和训练样本稀缺的语言中都非常有效。