Recent work on unsupervised speech segmentation has used self-supervised models with phone and word segmentation modules that are trained jointly. This paper instead revisits an older approach to word segmentation: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level). To do this, I propose a new unit discovery model, a new symbolic word segmentation model, and then chain the two models to segment speech. Both models use dynamic programming to minimize segment costs from a self-supervised network with an additional duration penalty that encourages longer units. Concretely, for acoustic unit discovery, duration-penalized dynamic programming (DPDP) is used with a contrastive predictive coding model as the scoring network. For word segmentation, DPDP is applied with an autoencoding recurrent neural as the scoring network. The two models are chained in order to segment speech. This approach gives comparable word segmentation results to state-of-the-art joint self-supervised segmentation models on an English benchmark. On French, Mandarin, German and Wolof data, it outperforms previous systems on the ZeroSpeech benchmarks. Analysis shows that the chained DPDP system segments shorter filler words well, but longer words might require some external top-down signal.
翻译:最近关于不受监督的语音分区的工作采用了自监督的模式,包括电话和文字分区模块,这些模块经过联合培训。本文反省了旧的单词分割方法:首先进行自下而上电话式单元发现,然后在所发现单位的顶部进行象征性的单词分割(不影响下层)。为此,我提议了一个新的单位发现模式,一个新的象征性单词分割模式,然后将两个模式连锁到分解演讲。两种模式都使用动态程序,以尽量减少自监督网络的分解成本,并附加期限罚款,鼓励更长的单位。具体地说,在声学单位发现方面,使用有对比的预测式动态动态编程(DP)模式作为评分网络。在字分割方面,DPDP是用一个自动编码的经常性内线分割模式作为评分网络。两种模式都以分解方式将分解结果连锁起来。两种模式都使最先进的联合自上调分解分解模式在英语基准上产生相同的分解结果,鼓励更长的单位。在法语、曼达林、德文和沃尔基化的动态动态动态动态动态动态编程动态编程动态编程程序中,需要更短的分解系统在前的磁段段段段段次中,以显示前的内的数据。