Regularized optimal transport (OT) is now increasingly used as a loss or as a matching layer in neural networks. Entropy-regularized OT can be computed using the Sinkhorn algorithm but it leads to fully-dense transportation plans, meaning that all sources are (fractionally) matched with all targets. To address this issue, several works have investigated quadratic regularization instead. This regularization preserves sparsity and leads to unconstrained and smooth (semi) dual objectives, that can be solved with off-the-shelf gradient methods. Unfortunately, quadratic regularization does not give direct control over the cardinality (number of nonzeros) of the transportation plan. We propose in this paper a new approach for OT with explicit cardinality constraints on the transportation plan. Our work is motivated by an application to sparse mixture of experts, where OT can be used to match input tokens such as image patches with expert models such as neural networks. Cardinality constraints ensure that at most $k$ tokens are matched with an expert, which is crucial for computational performance reasons. Despite the nonconvexity of cardinality constraints, we show that the corresponding (semi) dual problems are tractable and can be solved with first-order gradient methods. Our method can be thought as a middle ground between unregularized OT (recovered in the limit case $k=1$) and quadratically-regularized OT (recovered when $k$ is large enough). The smoothness of the objectives increases as $k$ increases, giving rise to a trade-off between convergence speed and sparsity of the optimal plan.
翻译:正则化的最优输运已经越来越多地被用作神经网络中的损失函数或匹配层。熵正则化的最优输运可以使用Sinkhorn算法计算,但它导致完全密度的输运计划,这意味着所有源都(分数)与所有目标匹配。为了解决这个问题,几个研究已经调查了二次正则化。这种正则化保留稀疏性并导致无约束和平滑的(半)对偶目标,可以使用现成的梯度方法求解。不幸的是,二次正则化不能直接控制输运计划的基数(非零数)。本文提出了一种新的输运方法,以明确输运计划上的基数约束。我们的工作受到了稀疏专家混合应用的启发,其中OT可以用来将输入令牌(如图像块)与专家模型(如神经网络)匹配。基数约束确保最多与专家匹配$k$个令牌,这对于计算性能来说非常重要。尽管基数约束具有非凸性,但我们展示了相应的(半)对偶问题是可行的,可以使用一阶梯度方法解决。我们的方法可以被认为是未正则化的OT(在极限情况下$k=1$恢复)和二次正则化的OT(当$k$足够大时恢复)之间的中间地带。随着$k$的增加,目标的平滑性增加,从而产生收敛速度和最优计划的稀疏性之间的折衷。