Multilingual machine translation suffers from negative interference across languages. A common solution is to relax parameter sharing with language-specific modules like adapters. However, adapters of related languages are unable to transfer information, and their total number of parameters becomes prohibitively expensive as the number of languages grows. In this work, we overcome these drawbacks using hyper-adapters -- hyper-networks that generate adapters from language and layer embeddings. While past work had poor results when scaling hyper-networks, we propose a rescaling fix that significantly improves convergence and enables training larger hyper-networks. We find that hyper-adapters are more parameter efficient than regular adapters, reaching the same performance with up to 12 times less parameters. When using the same number of parameters and FLOPS, our approach consistently outperforms regular adapters. Also, hyper-adapters converge faster than alternative approaches and scale better than regular dense networks. Our analysis shows that hyper-adapters learn to encode language relatedness, enabling positive transfer across languages.
翻译:多语言机器翻译受到不同语言的负面干扰。 一个共同的解决方案是放松与适应者等特定语言模块的参数共享。 但是,相关语言的适应者无法传输信息,随着语言数量的增长,其参数总数变得极其昂贵。 在这项工作中,我们用超适应器克服了这些缺陷 -- -- 高网络,从语言和层嵌入中产生适应器。虽然过去的工作在扩大超网络时效果不佳,但我们提议了一个重新缩放的解决方案,大大改进了趋同,并能够培训更大的超网络。我们发现,超适应者比常规的适应者更有效率,达到同样的性能,达到的参数要少12倍。在使用相同数量的参数和FLOPS时,我们的方法始终比常规适应器更优。此外,超适应器比其他方法聚集得更快,规模比常规密度网络更好。我们的分析显示,超适应者学会了对语言关联进行编码,从而能够在不同语言之间进行积极的传输。