Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.
翻译:由于对会议数据进行分解和源分离是密切相关的任务,我们在此提出一个共同执行这两个目标的方法,它以目标发言者语音活动探测(TS-VAD)分解方法为基础,假设最初有发言者嵌入器,我们用一个网络取代TS-VAD最后合并的发言者活动估计网络,通过时间-频率分辨率来产生发言者活动估计数,这些是源提取的遮掩物,通过遮罩或波束成形,这种技术既可用于单声道和多声道输入,也可以在两种情况下,在LibriCSS会议数据识别任务上达到新的最先进的字词误率,我们进一步计算发言者觉和发言者-声学WERs,以便分离分辨误差对WER总体性能的贡献。</s>