In this study, we present an approach to train a single speech enhancement network that can perform both personalized and non-personalized speech enhancement. This is achieved by incorporating a frame-wise conditioning input that specifies the type of enhancement output. To improve the quality of the enhanced output and mitigate oversuppression, we experiment with re-weighting frames by the presence or absence of speech activity and applying augmentations to speaker embeddings. By training under a multi-task learning setting, we empirically show that the proposed unified model obtains promising results on both personalized and non-personalized speech enhancement benchmarks and reaches similar performance to models that are trained specialized for either task. The strong performance of the proposed method demonstrates that the unified model is a more economical alternative compared to keeping separate task-specific models during inference.
翻译:在本研究中,我们提出一种方法来培训单一的语音增强网络,既能实现个性化又能实现非个性化语音增强。这通过纳入一个框架化的附加条件投入来实现。为了提高增强产出的质量并减轻过度压缩,我们试验了通过是否有语言活动来进行重新加权的框架,并对语音嵌入应用扩增。通过在多任务学习环境中的培训,我们从经验上表明,拟议的统一模式在个性化和非个性化语音增强基准方面都取得了有希望的结果,并取得了与为任一任务专门培训的模型类似的性能。拟议方法的强劲表现表明,统一模式是一种比较经济的替代方法,而不是在推断过程中保留不同的特定任务模式。