大数据分析是指对规模巨大的数据进行分析。大数据可以概括为5个V, 数据量大(Volume)、速度快(Velocity)、类型多(Variety)、价值(Value)、真实性(Veracity)。

VIP内容

大数据分析的一个关键挑战是如何收集大量(标记)数据。众包旨在通过聚合和估算来自广泛的客户/用户的高质量数据(如文本的情感标签)来解决这一挑战。现有的众包研究集中于设计新的方法来提高来自不可靠/嘈杂客户端的聚合数据质量。然而,迄今为止,这种众包系统的安全方面仍未得到充分的探索。我们的目标是在这项工作中填补这一缺口。具体来说,我们表明众包很容易受到数据中毒攻击,即恶意客户端提供精心制作的数据来破坏聚合数据。我们将我们所提议的数据中毒攻击规划为一个优化问题,使聚合数据的错误最大化。我们在一个合成的和两个真实的基准数据集上的评估结果表明,所提出的攻击可以显著地增加聚合数据的估计误差。我们还提出了两种防御来减少恶意客户端的影响。我们的实证结果表明,所提出的防御方法可以显著降低数据中毒攻击的估计误差。

https://www.zhuanzhi.ai/paper/d25992f7a7df3ee1468f244f05a8ba03

成为VIP会员查看完整内容
0
14

最新论文

This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_N\ll N$ to meet the memory constraint is sampled uniformly without replacement. Aggregating the estimators of $m_N$ subsamples can lead to subbagging estimation. To analyze the theoretical properties of the subbagging estimator, we adapt the incomplete $U$-statistics theory with an infinite order kernel to allow overlapping drawn subsamples in the sampling procedure. Utilizing this novel theoretical framework, we demonstrate that via a proper hyperparameter selection of $k_N$ and $m_N$, the subbagging estimator can achieve $\sqrt{N}$-consistency and asymptotic normality under the condition $(k_Nm_N)/N\to \alpha \in (0,\infty]$. Compared to the full sample estimator, we theoretically show that the $\sqrt{N}$-consistent subbagging estimator has an inflation rate of $1/\alpha$ in its asymptotic variance. Simulation experiments are presented to demonstrate the finite sample performances. An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.

0
0
下载
预览
Top