Training large-scale mixture of experts models efficiently on modern hardware requires assigning datapoints in a batch to different experts, each with a limited capacity. Recently proposed assignment procedures lack a probabilistic interpretation and use biased estimators for training. As an alternative, we propose two unbiased estimators based on principled stochastic assignment procedures: one that skips datapoints which exceed expert capacity, and one that samples perfectly balanced assignments using an extension of the Gumbel-Matching distribution [29]. Both estimators are unbiased, as they correct for the used sampling procedure. On a toy experiment, we find the `skip'-estimator is more effective than the balanced sampling one, and both are more robust in solving the task than biased alternatives.
翻译:在现代硬件上有效培训大规模混合专家模式需要向不同专家分批分配数据点,每个专家的能力都有限。最近提议的派任程序缺乏概率解释,使用偏颇的估测员进行培训。作为替代办法,我们提议根据原则性随机派任程序,设立两个公正的估算员:一个跳过超过专家能力的数据点,另一个样本使用扩大的Gumbel-Matching分布[29]来进行完全平衡的派任。两个选任者都是不偏不倚的,因为他们对使用过的采样程序是正确的。在玩具试验中,我们发现“斯基普”-估测仪比平衡的采样程序更有效,两者都比偏颇的替代方法更能解决任务。