Spam is a serious problem plaguing web-scale digital platforms which facilitate user content creation and distribution. It compromises platform's integrity, performance of services like recommendation and search, and overall business. Spammers engage in a variety of abusive and evasive behavior which are distinct from non-spammers. Users' complex behavior can be well represented by a heterogeneous graph rich with node and edge attributes. Learning to identify spammers in such a graph for a web-scale platform is challenging because of its structural complexity and size. In this paper, we propose SEINE (Spam DEtection using Interaction NEtworks), a spam detection model over a novel graph framework. Our graph simultaneously captures rich users' details and behavior and enables learning on a billion-scale graph. Our model considers neighborhood along with edge types and attributes, allowing it to capture a wide range of spammers. SEINE, trained on a real dataset of tens of millions of nodes and billions of edges, achieves a high performance of 80% recall with 1% false positive rate. SEINE achieves comparable performance to the state-of-the-art techniques on a public dataset while being pragmatic to be used in a large-scale production system.
翻译:垃圾邮件是一个严重的问题, 困扰着网络规模的数字平台, 方便用户内容的创建和发行。 它会损害平台的完整性, 包括建议和搜索等服务的绩效, 以及整体商业。 垃圾邮件从事各种与非垃圾邮件不同的虐待和回避行为。 用户的复杂行为可以通过富含节点和边缘属性的多元图来很好地表现。 学习为网络规模平台识别垃圾邮件, 因其结构复杂性和大小而具有挑战性。 在本文中, 我们提议 SEINE( 使用互动NETworks的Spam Detectry), 在一个新颖的图表框架中建立垃圾邮件检测模型。 我们的图表同时捕捉到丰富的用户细节和行为, 并且能够用10亿比例的图表学习。 我们的模型将周边与边缘类型和属性一起考虑, 能够捕捉到广泛的垃圾邮件。 SEINE, 受过关于数千万个节点和数十亿个边缘的真实数据集的培训, 取得高达80 % 的回顾效果, 1% 假正率。 SEINE在新的图表框架中实现可比较的实绩, 在使用实用的大规模生产技术中, 。