Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by separating overlapping segments independently and stitching adjacent segments to continuous output streams, this constraint has to be fulfilled for any segment. In this contribution, we show that this constraint can be significantly relaxed. We propose a novel graph-based PIT criterion, which casts the assignment of utterances to output channels in a graph coloring problem. It only requires that the number of concurrently active speakers must not exceed the number of output channels. As a consequence, the system can process an arbitrary number of speakers and arbitrarily long segments and thus can handle more diverse scenarios. Further, the stitching algorithm for obtaining a consistent output order in neighboring segments is of less importance and can even be eliminated completely, not the least reducing the computational effort. Experiments on meeting-style WSJ data show improvements in recognition performance over using the uPIT criterion.
翻译:自动对会议进行笔录处理要求处理重叠的演讲,这要求持续语音分离系统。UPIT标准是针对与神经网络的发声分解而提出的,并提出了限制,即发言者总数不得超过产出渠道的数目。在以片段方式处理类似会议的数据时,即通过将重叠部分分开和将相邻部分与连续产出流缝合,这种限制必须对任何部分都得到满足。在这个贡献中,我们表明这一限制可以大大放松。我们提出了一个基于图表的新型PIT标准,在图形颜色问题中将发音分配给输出渠道。它只要求同时活跃的发言者人数不得超过产出渠道的数目。因此,该系统可以处理任意的发言者人数和任意长段,从而可以处理更多样化的情况。此外,为在相邻部分获得一致的产出顺序而缝合的算法不太重要,甚至可以完全取消,不能减少计算努力。对会议式WSJ数据进行的实验显示在使用uIT标准确认业绩方面的改进。