How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks--tabular, vision, streaming, and a 45M-sample CTR task--plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.
翻译:当数据集规模庞大、异构且具有几何结构时,我们应如何量化每个训练样本的价值?经典Data-Shapley方法在原理上给出了解答,但其O(n!)的计算复杂度与逐点分析的视角已难以适应现代数据规模。本文提出分层对比数据估值(HCDV)框架,该三阶段方法能够:(i)学习保持几何结构的对比表示;(ii)将数据组织成平衡的从粗到细的聚类层次结构;(iii)通过局部蒙特卡洛博弈为数据联盟分配Shapley式收益值,其计算预算沿层次向下传播。HCDV将阶乘级计算负担降至O(T Σ_l K_l) = O(T K_max log n),能够奖励那些锐化决策边界的样本,并通过基于曲率的平滑性对异常值进行正则化处理。我们证明HCDV以O(η log n)的盈余损失近似满足四条Shapley公理,具有次高斯联盟偏差~O(1/√T),且在top-k选择中最多产生kε_∞遗憾。在四个基准测试(表格数据、视觉数据、流数据及包含4500万样本的点击率预测任务)及OpenDataVal套件上的实验表明:HCDV将模型准确率最高提升5个百分点,估值时间最高缩短100倍,并可直接支持数据增强过滤、低延迟流式更新及公平市场分配等实际任务。