This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autoregressive RNN-based decoder. Thirdly, a duration predictor instead of an attention model that connects the above hybrid encoder and decoder. Experiments indicate that Nana-HDR gives full play to the advantages of each component, such as strong text encoding ability of Transformer-based encoder, stateful decoding without being bothered by exposure bias and local information preference, and stable alignment provided by duration predictor. Due to these advantages, Nana-HDR achieves competitive performance in naturalness and robustness on two Mandarin corpora.
翻译:本文展示了Nana-HDR, 这是一种新型的非惯性非惯性非惯性模式,具有基于导变器的多元编码器和基于RNN的TTS解码器,主要由三部分组成:首先,是一个新型的Dense-fuse编码器,在基本变异器区块之间有着密集的连接,用于粗特质融合,以及用于精细融合的多端注意层。第二,一个单层非惯性非惯性RNN的解码器。第三,一个持续时间预测器,而不是连接上述混合编码器和解码器的注意模型。实验表明,Nana-HDR充分利用了每个组成部分的优势,例如基于变异器编码器的强大文本编码能力,没有受到暴露偏差和当地信息偏好的影响,以及期限预测器所提供的稳定调和。由于这些优势, Nana-HDR在自然和稳健性方面在两个曼达林公司取得了竞争性的绩效。