Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level. In this framework, a convolutional neural network learns frame-level representation from the raw waveform via noise-contrastive estimation (NCE). A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE to learn segment representations. The differentiable boundary detector allows us to train frame-level and segment-level encoders jointly. Typically, phoneme and word segmentation are treated as separate tasks. We unify them and experimentally show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets. We analyze the impact of boundary threshold and when is the right time to include the segmental loss in the learning process.
翻译:自动检测电话或类似字词的单位是零资源语音处理的核心目标之一。最近尝试采用自我监督的培训方法,例如对比预测编码(CPC),根据过去的情况预测下一个框架。然而,CPC只查看音频信号的框架层次结构。我们通过一个分层对比预测编码(SCCC)框架来克服这一限制,这种框架可以在更高层次(例如电话站一级)模拟信号结构。在这个框架内,一个连动神经网络通过噪音调频估计(NCE)从原始波形中学习框架级和字分解方法。一个不同的边界探测器发现多长段,然后通过NCE优化一个段编码器以学习区段的表达方式。不同的边界探测器允许我们共同培训框架层次和分层编码(SCC)框架。一般地,电话和文字分解作为单独的任务处理。我们统一它们并实验性地显示,我们单一的模型在TIMIT和Buckeyeye数据临界点上的现有电话和文字分断法方法在时间段中超越了现有断线和断线的分段,我们分析了在右断路段的分段的分段的影响。