This paper proposes VARA-TTS, a non-autoregressive (non-AR) text-to-speech (TTS) model using a very deep Variational Autoencoder (VDVAE) with Residual Attention mechanism, which refines the textual-to-acoustic alignment layer-wisely. Hierarchical latent variables with different temporal resolutions from the VDVAE are used as queries for residual attention module. By leveraging the coarse global alignment from previous attention layer as an extra input, the following attention layer can produce a refined version of alignment. This amortizes the burden of learning the textual-to-acoustic alignment among multiple attention layers and outperforms the use of only a single attention layer in robustness. An utterance-level speaking speed factor is computed by a jointly-trained speaking speed predictor, which takes the mean-pooled latent variables of the coarsest layer as input, to determine number of acoustic frames at inference. Experimental results show that VARA-TTS achieves slightly inferior speech quality to an AR counterpart Tacotron 2 but an order-of-magnitude speed-up at inference; and outperforms an analogous non-AR model, BVAE-TTS, in terms of speech quality.
翻译:本文建议 VARA- TTS, 这是一种非自动( 非 AR) 文本到语音模式, 使用一种非常深的动态自动读数器( VDVAE ), 并配有残余注意机制, 以完善文本到声学的对齐层。 使用VDVAE 中不同时间分辨率的等级潜在变量作为剩余注意模块的查询。 通过将先前关注层中粗糙的全球对齐作为额外输入, 以下的注意层可以产生一个精细的对齐版本。 这在多个关注层中重新组合学习文本到声调的对齐过程, 并且只优于单个注意层的稳健度。 发音速度系数由经过联合训练的语音速度预测器计算, 将粗皮层中的平均组合潜在潜在变量作为投入, 以确定推断时的声调框架数量。 实验结果表明, VAR- TTS 达到稍低的语音质量, 在对应对应的塔科托罗2 级语言标准中, 而不是质量。