Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the existing utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it lacks the ability to leverage the long-span relationship of utterances and is computationally inefficient due to the highly overlapped sliding windows. To overcome these drawbacks, we propose a novel training scheme named Group-PIT, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment. Two different speech separation approaches with Group-PIT are explored, including direct long-span speech separation and short-span speech separation with long-span tracking. The experiments on the simulated meeting-style data demonstrate the effectiveness of our proposed approaches, especially in dealing with a very long speech input.
翻译:多对话者语音处理为会议记录等各种应用程序吸引了许多兴趣,例如会议录音记录; 处理谈话中常见的重叠讲话往往需要将发言分开处理; 虽然事实证明,现有的语音水平的变异式训练持续语音分离办法在不同条件下是有效的,但它缺乏利用长长长的发音关系的能力,而且由于高度重叠的滑动窗口,计算效率低下; 为了克服这些缺陷,我们提议了一个名为Group-PIT的新培训计划,允许对长式演讲分离模式进行直接培训,对标签分配采用较低的计算成本; 探索了与Group-PIT的两种不同的语音分离方法,包括直接长长式讲话分离和短长式跟踪的语音分离。 模拟会议式数据实验显示了我们拟议方法的有效性,特别是在处理长式语音输入方面。