A persistently popular topic in online social networks is the rapid and accurate discovery of bot accounts to prevent their invasion and harassment of genuine users. We propose a unified embedding framework called BotTriNet, which utilizes textual content posted by accounts for bot detection based on the assumption that contexts naturally reveal account personalities and habits. Content is abundant and valuable if the system efficiently extracts bot-related information using embedding techniques. Beyond the general embedding framework that generates word, sentence, and account embeddings, we design a triplet network to tune the raw embeddings (produced by traditional natural language processing techniques) for better classification performance. We evaluate detection accuracy and f1score on a real-world dataset CRESCI2017, comprising three bot account categories and five bot sample sets. Our system achieves the highest average accuracy of 98.34% and f1score of 97.99% on two content-intensive bot sets, outperforming previous work and becoming state-of-the-art. It also makes a breakthrough on four content-less bot sets, with an average accuracy improvement of 11.52% and an average f1score increase of 16.70%.
翻译:摘要:在线社交网络中,一个持久的热门话题是快速而准确地发现机器人账户,以防止它们对真实用户的入侵和骚扰。我们提出了一个统一的嵌入框架BotTriNet,其利用账户发布的文本内容进行机器人检测,基于一个假设:上下文自然地揭示账户的个性和习惯。如果系统使用嵌入技术有效地提取与机器人相关的信息内容,那么文本将会是丰富和有价值的。除了生成单词、句子和账户嵌入的通用嵌入框架外,我们设计了一个三元组网络,用于调整传统自然语言处理技术产生的原始嵌入,以获得更好的分类性能。我们在一个真实数据集CRESCI2017上评估检测准确度和f1分数,该数据集包括三个机器人账户类别和五个不同机器人样本集。我们的系统在两个内容密集的机器人样本集上实现了最高的平均准确度和f1分数,分别为98.34%和97.99%,超越了之前的工作,成为最先进的技术。它在四个无内容机器人样本集上也取得了突破,平均准确度提高了11.52%,平均f1分数提高了16.70%。