Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.
翻译:最佳运输(OT)最近已成为数据科学中的一个中心工具,可以以几何忠实的方式比较云云和更普遍的概率分布。但将OT广泛应用于现有数据分析和机器学习管道却受到若干缺点的困扰,其中包括它缺乏对外部线的稳健性、其高计算成本、需要大量高尺寸的样本和难以在不同空间处理数据。在本次审查中,我们详细介绍了最近提出的减轻这些问题的若干方法。我们特别坚持对不平衡的OT进行不平衡的测试,它比较了不局限于概率分布的任意积极措施(即其总质量可能不同)。这种对OT的概括化使得它能够对外线和缺失的数据进行稳健。现代计算OT的第二步工作是整流,这导致了可缩放的算法,同时降低了高尺寸的样本复杂性。本次审查中的最后一点是Gromov-Wasserstein(GW)距离,它把OT扩大到应付不同计量空间的分布。本次审查的主要动机是解释如何使GOT的地理测量损失功能变成高效率的GRO 。