We propose an approach for cognitive coding of speech by unsupervised extraction of contextual representations in two hierarchical levels of abstraction. Speech attributes such as phoneme identity that last one hundred milliseconds or less are captured in the lower level of abstraction, while speech attributes such as speaker identity and emotion that persist up to one second are captured in the higher level of abstraction. This decomposition is achieved by a two-stage neural network, with a lower and an upper stage operating at different time scales. Both stages are trained to predict the content of the signal in their respective latent spaces. A top-down pathway between stages further improves the predictive capability of the network. With an application in speech compression in mind, we investigate the effect of dimensionality reduction and low bitrate quantization on the extracted representations. The performance measured on the LibriSpeech and EmoV-DB datasets reaches, and for some speech attributes even exceeds, that of state-of-the-art approaches.
翻译:我们提出一种通过不受监督地从两个层次的抽象层次上提取背景表现来对语音进行认知编码的方法。 语音属性,例如电话身份,在较低层次的抽象中捕捉到最后一百毫秒或以下,而语音身份和情绪等持续到一秒钟的语音属性,则在较高层次的抽象中捕捉到。 这种分解是通过一个两阶段神经网络实现的,在不同的时间尺度上下级和上级运行。 两个阶段都经过培训,以预测信号在它们各自潜在空间中的内容。 两个阶段之间的自上而下路径进一步提高了网络的预测能力。 在语音压缩中,我们研究了在声音压缩中的应用,我们调查了在提取的演示中减少维度和低位速四分法的影响。 在LibriSpeech 和 EmoV-DB 数据集上测量的性能达到,而有些语音属性甚至超过最先进的方法。