We consider a general task called partial Wasserstein covering with the goal of emulating a large dataset (e.g., application dataset) using a small dataset (e.g., development dataset) in terms of the empirical distribution by selecting a small subset from a candidate dataset and adding it to the small dataset. We model this task as a discrete optimization problem with partial Wasserstein divergence as an objective function. Although this problem is NP-hard, we prove that it has the submodular property, allowing us to use a greedy algorithm with a 0.63 approximation. However, the greedy algorithm is still inefficient because it requires linear programming for each objective function evaluation. To overcome this difficulty, we propose quasi-greedy algorithms for acceleration, which consist of a series of techniques such as sensitivity analysis based on strong duality and the so-called $C$-transform in the optimal transport field. Experimentally, we demonstrate that we can efficiently make two datasets similar in terms of partial Wasserstein divergence, including driving scene datasets.
翻译:我们认为一项一般任务,即部分瓦森斯坦(Wasserstein),目的是利用一个小数据集(例如,应用数据集)模拟大型数据集(例如,应用数据集),从候选数据集中选择一个小子集,并将其添加到小数据集中,从而在经验分布方面模拟一个大型数据集(例如,开发数据集),我们将此任务作为独立的优化问题,将部分瓦森斯坦(Wasserstein)部分差异作为一个客观功能来模拟。虽然这个问题是非全瓦森斯坦(NP-硬性)问题,但我们证明它具有亚模块属性,允许我们使用具有0.63近似值的贪婪算法。然而,贪婪算法仍然效率低下,因为它需要为每个目标函数评价制定线性程序。为克服这一困难,我们建议采用准基因算法加速,其中包括一系列技术,如基于强的双重性和最佳运输领域所谓的$-c$-traform的敏感度分析。实验说,我们可以有效地使两个数据集与部分瓦斯特斯坦(Wasserstein)差异相似,包括驱动场景数据集。