Unsupervised word segmentation in audio utterances is challenging as, in speech, there is typically no gap between words. In a preliminary experiment, we show that recent deep self-supervised features are very effective for word segmentation but require supervision for training the classification head. To extend their effectiveness to unsupervised word segmentation, we propose a pseudo-labeling strategy. Our approach relies on the observation that the temporal gradient magnitude of the embeddings (i.e. the distance between the embeddings of subsequent frames) is typically minimal far from the boundaries and higher nearer the boundaries. We use a thresholding function on the temporal gradient magnitude to define a psuedo-label for wordness. We train a linear classifier, mapping the embedding of a single frame to the pseudo-label. Finally, we use the classifier score to predict whether a frame is a word or a boundary. In an empirical investigation, our method, despite its simplicity and fast run time, is shown to significantly outperform all previous methods on two datasets.
翻译:在语音中进行无监督单词分割是具有挑战性的,因为在口语中通常单词之间没有停顿。在初步的实验中,我们表明,最近的深度自监督特征对于单词分割非常有效,但需要监督以训练分类头。我们提出一种伪标签策略,将它们的有效性扩展到无监督单词分割。我们的方法依赖于一种观察方式,即嵌入的时间梯度大小(即连续帧嵌入之间的距离)通常在边界附近较高,在边界远离较小。我们对时间梯度大小使用一个阈值函数来定义“单词”的伪标签。我们训练一个线性分类器,将单帧的嵌入映射到伪标签。最后,我们使用分类器得分来预测帧是单词还是界限。在实证研究中,我们的方法尽管简单且运行时间较快,但被证明在两个数据集上显着优于所有先前方法。