Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
翻译:基于文本的语音编辑允许用户通过直观剪切、复制和粘贴文本编辑语音,以加快编辑语音的进程。 在先前的工作中, CampNet (CampNet-aware 掩码预测网络) 提议实现基于文本的语音编辑, 大大提高编辑语音的质量。 本文旨在执行一项新的任务: 在基于文本的语音编辑过程中, 给编辑的语音编辑增加情感效应, 使生成的语音更能表达。 为了完成这项任务, 我们提议 Emo- ampNet (Emtion CampNet), 它可以为基于文本的语音编辑中生成的语音表达提供情感属性选项, 并具有编辑隐蔽语音演讲的一次性效果。 首先, 我们提出一个基于文本的 Enter-end-awa 掩码预测网络, 将基于基于内容的语音编辑功能添加更多的情感属性, 使原演讲的发言人的情绪不受原始语言部分的干扰, 提议一个中性的内容转换器, 将原始语言的情感编辑过程从原始语言的语音编辑中去除情感, 将情感- devorialalalalal resmal resual 。 rodu roal roal rodu roal roal rodu rodu rodu rodu rodu rodu rodu 。