争取在音频书数据集上跨口音阅读风格传输 (Towards Cross-speaker Reading Style Transfer on Audiobook Dataset)

Cross-speaker style transfer aims to extract the speech style of the given reference speech, which can be reproduced in the timbre of arbitrary target speakers. Existing methods on this topic have explored utilizing utterance-level style labels to perform style transfer via either global or local scale style representations. However, audiobook datasets are typically characterized by both the local prosody and global genre, and are rarely accompanied by utterance-level style labels. Thus, properly transferring the reading style across different speakers remains a challenging task. This paper aims to introduce a chunk-wise multi-scale cross-speaker style model to capture both the global genre and the local prosody in audiobook speeches. Moreover, by disentangling speaker timbre and style with the proposed switchable adversarial classifiers, the extracted reading style is made adaptable to the timbre of different speakers. Experiment results confirm that the model manages to transfer a given reading style to new target speakers. With the support of local prosody and global genre type predictor, the potentiality of the proposed method in multi-speaker audiobook generation is further revealed.

翻译：交叉语音风格传输旨在提取特定参考演讲的语音风格,可在任意目标演讲者的胸腔中复制,关于这个专题的现有方法已探索利用发音级风格标签,通过全球或本地规模的风格表示方式进行风格传输;然而,音频书数据集的典型特征是本地手动和全球风格,很少同时配有发音级别风格标签。因此,在不同演讲者之间正确转换阅读风格仍是一项艰巨的任务。本文的目的是在音频书演讲中引入块状多尺度跨语音风格模型,以捕捉全球genre 和本地 prosody 。此外,通过与拟议的可开关的对称式对称分类器脱钩扬扬扬扬扬扬扬扬扬扬扬扬扬扬扬扬扬扬扬扬音的语风格和风格,提取的阅读风格适应了不同演讲者的节奏。实验结果证实,该模型设法将给定的阅读风格转移到新的目标演讲者。在本地流传和全球流传类型预测器的支持下,进一步揭示了多语箱音频谱生成中的拟议方法的潜力。