Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive speaker characteristics than conversational speech. It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task. To this end, this work proposes a novel hierarchical speaker representation framework for SVC, which can capture fine-grained speaker characteristics at different granularity. It consists of an up-sampling stream and three down-sampling streams. The up-sampling stream transforms the linguistic features into audio samples, while one down-sampling stream of the three operates in the reverse direction. It is expected that the temporal statistics of each down-sampling block can represent speaker characteristics at different granularity, which will be engaged in the up-sampling blocks to enhance the speaker modeling. Experiment results verify that the proposed method outperforms both the LUT and SRN based SVC systems. Moreover, the proposed system supports the one-shot SVC with only a few seconds of reference audio.
翻译:通常,歌声转换(SVC)取决于嵌入矢量,它从一个扬声器查看表(LUT)或扬声识别网(SRN)中提取,以模拟扬声器身份。但是,歌唱含有比对话语更能表达的扬声特征。人们怀疑,单个嵌入矢量只能捕捉平均和粗粗微的扬声特征,而这对于SVC的任务来说是不够的。为此,本工作为SVC提议了一个新型的按等级的扬声器代表框架,它可以捕捉不同颗粒度的微微分扬声器特性。它包括一个上层采样流和三个下层采样流。上层采样流将语言特征转换为音频样本,而三层的低采样流则向反方向运行。预计每个下采样块的时间统计可以代表不同微度的扬声器特性,这些微粒度将被用于上层采样块中,以加强喇叭模型。实验结果核实拟议的方法是否优于LUT和SRN的微秒参考系统。此外,拟议的SC只支持SVC的系统。