A persistently popular topic in online social networks is the rapid and accurate discovery of bot accounts to prevent their invasion and harassment of genuine users. We propose a unified embedding framework called BotTriNet, which utilizes textual content posted by accounts for bot detection based on the assumption that contexts naturally reveal account personalities and habits. Content is abundant and valuable if the system efficiently extracts bot-related information using embedding techniques. Beyond the general embedding framework that generates word, sentence, and account embeddings, we design a triplet network to tune the raw embeddings (produced by traditional natural language processing techniques) for better classification performance. We evaluate detection accuracy and f1score on a real-world dataset CRESCI2017, comprising three bot account categories and five bot sample sets. Our system achieves the highest average accuracy of 98.34% and f1score of 97.99% on two content-intensive bot sets, outperforming previous work and becoming state-of-the-art. It also makes a breakthrough on four content-less bot sets, with an average accuracy improvement of 11.52% and an average f1score increase of 16.70%.
翻译:在线社交网络中持久受欢迎的话题是快速准确地发现机器人账户以防止它们入侵和骚扰真正的用户。本文提出了一个统一的嵌入框架BotTriNet,它利用账户发布的文本内容进行机器人检测,基于假设认为上下文自然地揭示了账户的个性和习惯。如果系统使用嵌入技术有效地提取与机器人相关的信息,则内容是丰富而有价值的。除了生成词、句子和账户嵌入的一般嵌入框架外,我们设计了一个三元组网络来调整原始嵌入(由传统自然语言处理技术产生)以获得更好的分类性能。我们在真实数据集CRESCI2017上评估了检测准确性和f1得分,该数据集包含三个机器人账户类别和五个机器人样本集。我们的系统在两个内容密集型机器人集上实现了最高的平均准确率98.34%和f1得分97.99%,超越了以前的工作,成为最先进的。它还在四个无内容机器人集上取得了突破,平均准确率提高了11.52%,平均f1得分提高了16.70%。