Self-supervised ASR-TTS models suffer in out-of-domain data conditions. Here we propose an enhanced ASR-TTS (EAT) model that incorporates two main features: 1) The ASR$\rightarrow$TTS direction is equipped with a language model reward to penalize the ASR hypotheses before forwarding it to TTS. 2) In the TTS$\rightarrow$ASR direction, a hyper-parameter is introduced to scale the attention context from synthesized speech before sending it to ASR to handle out-of-domain data. Training strategies and the effectiveness of the EAT model are explored under out-of-domain data conditions. The results show that EAT reduces the performance gap between supervised and self-supervised training significantly by absolute 2.6\% and 2.7\% on Librispeech and BABEL respectively.
翻译:自我监督的ASR-TTS模型在外部数据条件下受到损害。在这里,我们提出一个强化的ASR-TTS(EAT)模型,其中包括两个主要特点:(1) ASR$\rightrowr$TTS方向配备了一种语言模型奖赏,以惩罚ASR假设,然后将其转交TTS。(2) 在TTS$\rightrowr$ASR方向上,引入了一个超参数,以扩大综合演讲的注意范围,然后将其发送给ASR处理外部数据。在外部数据条件下探讨了培训战略和EAT模型的有效性。结果显示,EAT大大缩小了监督培训和自我监督培训之间的绩效差距,分别对Librispeech和BABELL分别进行了绝对2.6 ⁇ 和2.7 ⁇ 。