We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training. Using the TIMIT and Buckeye corpora we train and test the model in the supervised and unsupervised settings. The latter case is accomplished by furnishing a noisy label-set with the predictions of a separate model, it having been trained in an unsupervised fashion. Results indicate our model eclipses previous state-of-the-art performance in both settings and on both datasets. Finally, following observations during published code review and attempts to reproduce past segmentation results, we find a need to disambiguate the definition and implementation of widely-used evaluation metrics. We resolve this ambiguity by delineating two distinct evaluation schemes and describing their nuances.
翻译:我们将学习传授用于电话路段,并展示在自行监督的任务前训练中学习到的演示的效用。我们的模型将变压器式的编码器扩展为具有战略位置的变异器,这些变压器将操作培训前所学特征的功能加以操纵。我们利用TIMIT和Buckey Corbora在受监管和无人监督的环境中培训和测试模型。后一种情况是通过提供一种杂乱的标签来完成的,配有单独的模型的预测,该模型经过未经监督的培训。结果显示我们的模型在设置和两个数据集中都取代了以往最先进的性能。最后,在公布代码审查期间的观察意见和试图复制过去的分解结果之后,我们发现有必要分离广泛使用的评价指标的定义和实施。我们通过分解两种不同的评价计划并描述其细微之处来解决这种模糊性。