In this paper, we present a novel training method for speaker change detection models. Speaker change detection is often viewed as a binary sequence labelling problem. The main challenges with this approach are the vagueness of annotated change points caused by the silences between speaker turns and imbalanced data due to the majority of frames not including a speaker change. Conventional training methods tackle these by artificially increasing the proportion of positive labels in the training data. Instead, the proposed method uses an objective function which encourages the model to predict a single positive label within a specified collar. This is done by marginalizing over all possible subsequences that have exactly one positive label within the collar. Experiments on English and Estonian datasets show large improvements over the conventional training method. Additionally, the model outputs have peaks concentrated to a single frame, removing the need for post-processing to find the exact predicted change point which is particularly useful for streaming applications.
翻译:在本文中,我们为语音变换检测模型提供了一种新的培训方法。 语音变换检测通常被视为一个二进制标签问题。 这种方法的主要挑战在于:由于大多数框架不包括变换发言者,发言者旋转之间的沉默和数据不平衡造成的附加说明的变化点含糊不清。 常规培训方法通过人为地增加培训数据中正标签的比例来解决这些问题。 相反,拟议方法使用客观功能,鼓励模型预测指定项圈内的单一正标签。 这样做的方式是将所有可能的子序列边缘化,这些子序列在项圈内有一个精确的正标签。 对英语和爱沙尼亚数据集的实验显示常规培训方法有了很大的改进。 此外,模型产出的峰值集中在一个单一的框架,因此不需要后处理找到精确的预测变化点,这对于流程应用特别有用。