Optimal transport distances have become a classic tool to compare probability distributions and have found many applications in machine learning. Yet, despite recent algorithmic developments, their complexity prevents their direct use on large scale datasets. To overcome this challenge, a common workaround is to compute these distances on minibatches i.e. to average the outcome of several smaller optimal transport problems. We propose in this paper an extended analysis of this practice, which effects were previously studied in restricted cases. We first consider a large variety of Optimal Transport kernels. We notably argue that the minibatch strategy comes with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with limits: the minibatch OT is not a distance. To recover some of the lost distance axioms, we introduce a debiased minibatch OT function and study its statistical and optimisation properties. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, generative adversarial networks (GANs) or color transfer that highlight the practical interest of this strategy.
翻译:最佳运输距离已成为比较概率分布的经典工具,在机器学习中发现了许多应用。然而,尽管最近算法的发展,其复杂性阻碍了其在大规模数据集的直接使用。为了克服这一挑战,一个共同的变通办法是计算微型公用厕所的距离,即平均计算几个较小的最佳运输问题的结果。我们在本文件中提议对这种做法进行扩大分析,以前曾对这种做法在有限情况下产生的影响进行过研究。我们首先考虑多种最佳运输内核。我们特别认为,小型运输战略具有吸引性的特性,例如公正的估计、梯度和围绕着预期的集中,但也有限度:微型批量OT不是距离。为了恢复一些丢失的距离轴,我们引入了一种有偏差的微型批量OT函数,并研究其统计和优化特性。除了这一理论分析之外,我们还对梯度流、归正性对抗网络(GANs)或彩色转移进行了实验,以突出这一战略的实际兴趣。