Multi-task learning with an unbalanced data distribution skews model learning towards high resource tasks, especially when model capacity is fixed and fully shared across all tasks. Sparse scaling architectures, such as BASELayers, provide flexible mechanisms for different tasks to have a variable number of parameters, which can be useful to counterbalance skewed data distributions. We find that that sparse architectures for multilingual machine translation can perform poorly out of the box, and propose two straightforward techniques to mitigate this - a temperature heating mechanism and dense pre-training. Overall, these methods improve performance on two multilingual translation benchmarks compared to standard BASELayers and Dense scaling baselines, and in combination, more than 2x model convergence speed.
翻译:多任务学习,以不平衡的数据分配模式为主,学习高资源任务,特别是当模型能力固定下来,在所有任务中充分共享时,尤其如此。诸如BASELayers等粗糙的缩放结构为不同任务提供了灵活的机制,以拥有可变参数,这可能有助于抵消偏斜的数据分布。我们发现,多语种机器翻译的稀疏结构可能表现不佳,并提出了两种简单易懂的缓解技术 — — 温度加热机制和密集的训练前技术。总的来说,与标准的BASELayers和Dense缩放基线相比,这些方法提高了两个多语种翻译基准的绩效,并结合了超过2x模式的趋同速度。