Overlapping speech diarization is always treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding the multi-speaker labels with power set. Specifically, we propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels according to the similarities between speech features and given speaker embeddings. Our method is further extended and integrated with downstream tasks by utilizing the textual information, which has not been well studied in previous literature. The experimental results show that our method achieves lower diarization error rate than the target-speaker voice activity detection. When textual information is involved, the diarization errors can be further reduced. For the real meeting scenario, our method can achieve 34.11% relative improvement compared with the Bayesian hidden Markov model based clustering algorithm.
翻译:重叠的语音二分法总是被当作多标签分类问题处理 。 在本文中, 我们通过将多发言标签编码为电源组, 将此任务重新定位为单标签预测问题 。 具体地说, 我们建议用语言嵌入有觉觉神经二分法( SEND) 来预测按语音特征和给定发言者嵌入的相似性来编码的标签的功率。 我们的方法通过使用文本信息进一步扩展, 并与下游任务整合, 而先前的文献对此没有很好地研究。 实验结果显示, 我们的方法比目标语音标签活动探测的二分解错误率要低。 当涉及文本信息时, diar化错误可以进一步减少 。 对于真正的会议设想, 我们的方法可以比基于 Bayesian 隐藏的 Markov 群集算法实现34. 11% 的相对改进 。