选择具有代表性的子抽样并应用高效内核密度估计的最佳运输方法 (An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation)

Subsampling methods aim to select a subsample as a surrogate for the observed sample. Such methods have been used pervasively in large-scale data analytics, active learning, and privacy-preserving analysis in recent decades. Instead of model-based methods, in this paper, we study model-free subsampling methods, which aim to identify a subsample that is not confined by model assumptions. Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks. Most of these methods suffer from either a large computational burden or a theoretical weakness. In particular, the theoretical weakness is that the empirical distribution of the selected subsample may not necessarily converge to the population distribution. Such computational and theoretical limitations hinder the broad applicability of model-free subsampling methods in practice. We propose a novel model-free subsampling method by utilizing optimal transport techniques. Moreover, we develop an efficient subsampling algorithm that is adaptive to the unknown probability density function. Theoretically, we show the selected subsample can be used for efficient density estimation by deriving the convergence rate for the proposed subsample kernel density estimator. We also provide the optimal bandwidth for the proposed estimator. Numerical studies on synthetic and real-world datasets demonstrate the performance of the proposed method is superior.

翻译：子抽样方法旨在选择一个子抽样作为观察到的样本的替代物。近几十年来,这些方法在大型数据分析、积极学习和隐私保护分析中被广泛使用。本文中,我们研究的不是基于模型的方法,而是不使用模型的子抽样方法,这些方法的目的是确定一个不局限于模型假设的子抽样。现有的不使用模型的子抽样方法通常以集群技术或内核技巧为基础。这些方法大多存在巨大的计算负担或理论弱点。特别是, 理论弱点是,所选子抽样的经验分布不一定与人口分布趋同。这些计算和理论限制妨碍了不使用模型的子抽样方法在实践中的广泛适用性。我们建议采用新的不使用模型的子抽样方法。此外, 我们开发了一个高效的子抽样算法,适应未知的概率密度函数。我们显示,所选的子抽样可以用于高效率的密度估计,方法是为最佳的合成品层数据密度计算得出最佳趋同率率。拟议的合成品层的高级品层研究也提供最佳性能展示最佳性能。