Summarizing data via representative examples is an important problem in several machine learning applications where human understanding of the learning models and underlying data distribution is essential for decision making. In this work, we develop an optimal transport (OT) based framework to select informative prototypical examples that best represent a given target dataset. We model the prototype selection problem as learning a sparse (empirical) probability distribution having minimum OT distance from the target distribution. The learned probability measure supported on the chosen prototypes directly corresponds to their importance in representing and summarizing the target data. We show that our objective function enjoys a key property of submodularity and propose a parallelizable greedy method that is both computationally fast and possess deterministic approximation guarantees. Empirical results on several real world benchmarks illustrate the efficacy of our approach.
翻译:通过具有代表性的例子总结数据是若干机器学习应用中的一个重要问题,在这些应用中,人类对学习模型和基本数据分布的理解对决策至关重要。在这项工作中,我们开发了一个基于最佳运输的框架,选择最能代表特定目标数据集的信息性原型样范例。我们将原型选择问题模拟为学习一种与目标分布相比距离最小的微小(经验性)概率分布。在所选原型上所支持的学习概率测量与其在代表并总结目标数据方面的重要性直接相对应。我们表明,我们的目标功能具有亚模式性的关键属性,并提出一种可平行的贪婪方法,该方法既可快速计算,又具有确定性近似保证。关于几个实际世界基准的经验性结果说明了我们方法的功效。