Applying changes to an input speech signal to change the perceived speaker of speech to a target while maintaining the content of the input is a challenging but interesting task known as Voice conversion (VC). Over the last few years, this task has gained significant interest where most systems use data-driven machine learning models. Doing the conversion in a low-latency real-world scenario is even more challenging constrained by the availability of high-quality data. Data augmentations such as pitch shifting and noise addition are often used to increase the amount of data used for training machine learning based models for this task. In this paper we explore the efficacy of common data augmentation techniques for real-time voice conversion and introduce novel techniques for data augmentation based on audio and voice transformation effects as well. We evaluate the conversions for both male and female target speakers using objective and subjective evaluation methodologies.
翻译:应用对输入语音信号的修改,在保持输入内容的同时,将感知的演讲者转换为目标,是一项具有挑战性但有趣的任务,称为语音转换。 过去几年来,在大多数系统使用数据驱动机学习模型的地方,这项任务引起了极大的兴趣。 在低纬度现实世界情景下进行转换,由于高质量数据的可用性而更具挑战性。 数据扩增,如轮廓移动和噪音添加,常常被用来增加用于培训以机器学习为基础的任务模型的数据数量。 在本文中,我们探讨了实时语音转换通用数据增强技术的功效,并采用了基于音频和声音转换效应的新型数据增强技术。我们使用客观和主观的评价方法评估男女目标发言者的转换情况。