A key challenge of big data analytics is how to collect a large volume of (labeled) data. Crowdsourcing aims to address this challenge via aggregating and estimating high-quality data (e.g., sentiment label for text) from pervasive clients/users. Existing studies on crowdsourcing focus on designing new methods to improve the aggregated data quality from unreliable/noisy clients. However, the security aspects of such crowdsourcing systems remain under-explored to date. We aim to bridge this gap in this work. Specifically, we show that crowdsourcing is vulnerable to data poisoning attacks, in which malicious clients provide carefully crafted data to corrupt the aggregated data. We formulate our proposed data poisoning attacks as an optimization problem that maximizes the error of the aggregated data. Our evaluation results on one synthetic and two real-world benchmark datasets demonstrate that the proposed attacks can substantially increase the estimation errors of the aggregated data. We also propose two defenses to reduce the impact of malicious clients. Our empirical results show that the proposed defenses can substantially reduce the estimation errors of the data poisoning attacks.
翻译:大数据分析的关键挑战是如何收集大量(标签的)数据。众包的目的是通过汇总和估计来自普遍客户/用户的高质量数据(如文字的情绪标签)来应对这一挑战。关于众包的现有研究侧重于设计新方法,以提高来自不可靠/有问题客户的汇总数据质量。然而,迄今为止,这种众包系统的安全方面仍未得到充分探讨。我们的目标是缩小这项工作中的这一差距。具体地说,我们表明众包易受数据中毒袭击的影响,其中恶意客户提供精心编造的数据以腐蚀汇总数据。我们把我们拟议的数据中毒袭击编成一个优化问题,最大限度地扩大汇总数据的错误。我们对一个合成和两个真实世界基准数据集的评价结果表明,拟议的袭击可能大大增加汇总数据的估计误差。我们还提出了两个防点,以减少恶意客户的影响。我们的经验显示,拟议的防御可以大幅降低数据中毒袭击的估计误差。