Guided source separation (GSS) is a type of target-speaker extraction method that relies on pre-computed speaker activities and blind source separation to perform front-end enhancement of overlapped speech signals. It was first proposed during the CHiME-5 challenge and provided significant improvements over the delay-and-sum beamforming baseline. Despite its strengths, however, the method has seen limited adoption for meeting transcription benchmarks primarily due to its high computation time. In this paper, we describe our improved implementation of GSS that leverages the power of modern GPU-based pipelines, including batched processing of frequencies and segments, to provide 300x speed-up over CPU-based inference. The improved inference time allows us to perform detailed ablation studies over several parameters of the GSS algorithm -- such as context duration, number of channels, and noise class, to name a few. We provide end-to-end reproducible pipelines for speaker-attributed transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting. Our code and recipes are publicly available: https://github.com/desh2608/gss.
翻译:引导源分离(GSS)是一种目标扩音器提取方法,它依靠预先计算演讲人的活动和盲光源分离进行前端增强重叠的语音信号,最初是在CHime-5挑战期间提出的,对延迟和光束成型基线有重大改进。尽管其优点,但这种方法在达到转录基准方面被采用有限,这主要是因为其计算时间长。在本文中,我们描述我们改进了GSS的实施,利用现代基于GPU的管道的力量,包括频率和部件的分批处理,提供300x速度超过基于CPU的推断。经过改进的推断时间使我们能够对GS算法的若干参数 -- -- 如上下文持续时间、频道数量和噪音等级 -- -- 进行详细的模拟研究。我们提供了配有语音的流行会议基准的端到端可复制的管道:LibriCSS、AMI和AliMeeting。我们的代码和食谱可以公开查阅: https://github.com/esh8/ 260。