The field of Singing Voice Synthesis (SVS) has seen significant advancements in recent years due to the rapid progress of diffusion-based approaches. However, capturing vocal style, genre-specific pitch inflections, and language-dependent characteristics remains challenging, particularly in low-resource scenarios. To address this, we propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism, specifically designed for Bollywood Hindi singing style. We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation. Additionally, we incorporated a style encoder and a pitch extraction model to compute style and pitch losses, capturing features essential to the naturalness and expressiveness of the synthesized singing, particularly in terms of vocal style and pitch variations. Furthermore, we utilize MERT and IndicWav2Vec models to extract musical and contextual embeddings, serving as conditional priors to refine the acoustic feature generation process further. Based on objective and subjective evaluations, we demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset that is typical of the low resource scenario.
翻译:近年来,随着基于扩散的方法的快速发展,歌唱语音合成领域取得了显著进展。然而,捕捉声乐风格、特定流派的音高变化以及语言依赖特征仍然具有挑战性,尤其是在低资源场景下。为此,我们提出了LAPS-Diff,这是一种融合了语言感知嵌入和声乐风格引导学习机制的扩散模型,专门针对宝莱坞印地语歌唱风格设计。我们整理了一个印地语SVS数据集,并利用预训练语言模型提取词级和音素级嵌入,以丰富歌词表示。此外,我们引入了一个风格编码器和一个音高提取模型来计算风格和音高损失,捕获对合成歌唱的自然度和表现力至关重要的特征,特别是在声乐风格和音高变化方面。进一步地,我们利用MERT和IndicWav2Vec模型提取音乐和上下文嵌入,作为条件先验,以进一步优化声学特征的生成过程。基于客观和主观评估,我们证明,与所考虑的最先进模型相比,LAPS-Diff在我们典型的低资源场景约束数据集上显著提高了生成样本的质量。