This paper proposes a new voice conversion (VC) task from human speech to dog-like speech while preserving linguistic information as an example of human to non-human creature voice conversion (H2NH-VC) tasks. Although most VC studies deal with human to human VC, H2NH-VC aims to convert human speech into non-human creature-like speech. Non-parallel VC allows us to develop H2NH-VC, because we cannot collect a parallel dataset that non-human creatures speak human language. In this study, we propose to use dogs as an example of a non-human creature target domain and define the "speak like a dog" task. To clarify the possibilities and characteristics of the "speak like a dog" task, we conducted a comparative experiment using existing representative non-parallel VC methods in acoustic features (Mel-cepstral coefficients and Mel-spectrograms), network architectures (five different kernel-size settings), and training criteria (variational autoencoder (VAE)- based and generative adversarial network-based). Finally, the converted voices were evaluated using mean opinion scores: dog-likeness, sound quality and intelligibility, and character error rate (CER). The experiment showed that the employment of the Mel-spectrogram improved the dog-likeness of the converted speech, while it is challenging to preserve linguistic information. Challenges and limitations of the current VC methods for H2NH-VC are highlighted.
翻译:本文提出一个新的声音转换任务,从人类语言转换为狗类语言,同时保留语言信息,作为人类和非人类生物声音转换(H2NH-VC)任务的一个实例。虽然大多数VC研究涉及人类语言转换为人类语言,但H2NH-VC的目的是将人类语言转换为非人类生物语言。非平行VC允许我们开发H2NH-VC,因为我们无法收集非人类生物语言的平行数据集。在本研究中,我们建议使用狗作为非人类生物目标域的范例,并定义“像狗一样说话”任务。尽管大多数VC研究涉及人类语言转换为人类语言,但H2NHHH-VC的目的是将人类语言转换为非人类生物语言,我们利用现有的具有代表性的非语言VC方法将人类语言转换为非人类生物语言语言表达方式(Mel-cepstral 系数和Mel-spectrograms)、网络结构(五种不同的内部规模环境),以及培训标准(变式自动分类观点(VE)以非人类生物为非人类生物的目标域域域域域域域域域域域和基因分类化的软体语言表达), 定义定义定义定义定义定义任务定义任务定义任务定义任务定义任务任务任务。最后,将声音转换为“SlCERlevolvel-realviductional-liviclviclview-realtitionaltitionality ” labildaldalityality 和语言转换为语言转换为一种语言结构, 。