We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations. Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than that of the upcoming representations to which they will be aligned. In this way, the prediction network solves a simpler task of predicting the next symbols, but not their exact timing, while the encoding network is trained to produce piece-wise constant latent codes. We evaluate the model on a speech coding task and demonstrate that the proposed Aligned Contrastive Predictive Coding (ACPC) leads to higher linear phone prediction accuracy and lower ABX error rates, while being slightly faster to train due to the reduced number of prediction heads.
翻译:我们调查了强制使用对比性预测损失而培训的自我监督模型以提取缓慢的不同潜在表现的可能性。该模型没有为每个未来表现提出个别预测,而是给出了比它们将要与之配合的即将出现的表示更短的预测顺序。 这样,预测网络就解决了预测下一个符号的简单任务,而不是准确的时间安排,而编码网络则经过培训,可以生成小巧的常态潜伏代码。我们评估了演讲编码任务模型,并表明拟议的统一对比预测编码(ACPC)可以提高线性电话预测准确性,降低ABX误差率,同时由于预测头减少,培训速度略快。