The past several years have witnessed significant progress in modeling the Cocktail Party Problem in terms of speech separation and speaker extraction. In recent years, multi-modal cues, including spatial information, facial expression and voiceprint, are introduced to speaker extraction task to serve as complementary information to each other to achieve better performance. However, the front-end model, for speaker extraction, become large and hard to deploy on a resource-constrained device. In this paper, we address the aforementioned problem with novel model architectures and model compression techniques, and propose a lightweight multi-modal framework for speaker extraction (dubbed LiMuSE), which adopts group communication (GC) to split multi-modal high-dimension features into groups of low-dimension features with smaller width which could be run in parallel, and further uses an ultra-low bit quantization strategy to achieve lower model size. The experiments on the GRID dataset show that incorporating GC into the multi-modal framework achieves on par or better performance with 24.86 times fewer parameters, and applying the quantization strategy to the GC-equipped model further obtains about 9 times compression ratio while maintaining a comparable performance compared with baselines. Our code will be available at https://github.com/aispeech-lab/LiMuSE.
翻译:在过去几年里,在制作鸡尾酒党问题的语音分隔和语音提取模型方面取得了显著进展;近年来,在语音提取任务中引入了多种模式提示,包括空间信息、面部表达和语音打印,作为相互补充的信息,以达到更好的性能;然而,前端模式,即语音提取,变得巨大和难以在资源受限制的装置上部署;在本文件中,我们用新型模型结构和压缩模型技术来解决上述问题,并提议为发言者提取提供一个轻巧的多式框架(低频LIMOSE),采用群体通信(GC),将多模式高分解高分解功能,分为低层特征群体,其宽度较小,可平行运行,并进一步使用超低位四分化战略,以降低模型大小。全球资源数据库数据集实验显示,将GC纳入多模式框架,以比重或更好的性能实现比重为24.86倍的参数,并将分解战略应用于GC设备化模型,将组合通信(GC)将多模式分为大约9倍的压缩比例,同时保持可平行运行的MUL/SE的基线。