The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.
翻译:顶端操作员返回 k- sparse 矢量, 非零值与输入的最大值相对应 。 不幸的是, 因为它是一个不连续的函数, 很难在神经网络中加入经过训练的端到端与端与反光反向的功能。 最近的工程已经考虑过基于正规化或扰动技术的不同放松。 但是, 至今为止, 没有一种方法是完全不同和稀疏的。 在本文中, 我们提议新的可区分和稀疏的顶端操作员。 我们把顶端操作员看成一个线性程序, 超越了 permutahedron, 即 配置的 convex 结构。 然后我们引入一个 p- nologm 正规化术语来平滑操作器, 并显示其计算可以降为异度优化。 我们的框架比现有的框架要一般得多得多, 并允许表达选择规模值的顶端操作员。 在算法方面, 除了将相邻的 violers (PAV) 算法之外, 我们建议一个新的 GPU/ TPU- 友好的调控控系统调调调调调调算器, 系统在溶性平流的网络中, 变压中, 我们成功地将操作员操作员专家作为溶压到 。