Recently, online social media has become a primary source for new information and misinformation or rumours. In the absence of an automatic rumour detection system the propagation of rumours has increased manifold leading to serious societal damages. In this work, we propose a novel method for building automatic rumour detection system by focusing on oversampling to alleviating the fundamental challenges of class imbalance in rumour detection task. Our oversampling method relies on contextualised data augmentation to generate synthetic samples for underrepresented classes in the dataset. The key idea exploits selection of tweets in a thread for augmentation which can be achieved by introducing a non-random selection criteria to focus the augmentation process on relevant tweets. Furthermore, we propose two graph neural networks(GNN) to model non-linear conversations on a thread. To enhance the tweet representations in our method we employed a custom feature selection technique based on state-of-the-art BERTweet model. Experiments of three publicly available datasets confirm that 1) our GNN models outperform the the current state-of-the-art classifiers by more than 20%(F1-score); 2) our oversampling technique increases the model performance by more than 9%;(F1-score) 3) focusing on relevant tweets for data augmentation via non-random selection criteria can further improve the results; and 4) our method has superior capabilities to detect rumours at very early stage.
翻译:最近,在线社交媒体已成为新信息、错误信息或谣言的主要来源。在没有自动谣言检测系统的情况下,谣言的传播增加了导致社会严重伤害的多方面。在这项工作中,我们提出了一种创新方法,用于建立自动谣言检测系统,重点是过度抽样,以减轻在发现谣言任务中阶级不平衡的基本挑战。我们的过度抽样方法依赖于背景化数据增强,以便为数据集中代表性不足的类别生成合成样本。关键理念利用了在扩大线中选择推文的线索,而通过引入非随机选择标准,使增强进程侧重于相关的推文,就可以实现这种选择。此外,我们提出了两个图形神经网络(GNNN),以模拟非线性对话。为了加强我们方法中的推特表达方式,我们采用了一种基于最先进的BERTweet模型的定制特征选择技术。对三种公开提供的数据集的实验证实:(1) 我们的GNN模式超越了当前最先进的分类标准,20 %以上(F1核心);(2) 我们的超级分类标准在升级的阶段提高了我们用于升级的种子选择能力;1号模型能够通过更精确的升级的方法提高我们的升级标准。