SimulMEGA：MoE路由器作为同步语音翻译的高级策略制定者 (SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation)

Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

翻译：同步语音翻译（SimulST）通过在严格延迟约束下联合优化语音识别与机器翻译，实现实时跨语言交流。现有系统难以平衡翻译质量、延迟与语义连贯性，尤其在多语言多对多场景中，读写策略的差异阻碍了统一策略的学习。本文提出SimulMEGA（基于专家混合门控的同步生成），一种无监督策略学习框架，结合基于前缀的训练与专家混合优化器，以隐式方式学习有效的读写决策，且不增加推理时开销。该设计仅需对标准Transformer架构进行最小修改，并可泛化至语音到文本及文本到语音流式任务。通过在六对语言上的全面评估，我们的5亿参数语音到文本模型优于Seamless基线，在1.5秒平均延迟下实现低于7%的BLEU下降，在3秒延迟下低于3%。我们进一步通过将SimulMEGA扩展至基于单向骨干网络的流式文本到语音任务，展示了其多功能性，获得了更优的延迟-质量权衡。