Speaker embeddings extracted with deep 2D convolutional neural networks are typically modeled as projections of first and second order statistics of channel-frequency pairs onto a linear layer, using either average or attentive pooling along the time axis. In this paper we examine an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statistics. The method is inspired by style-transfer methods in computer vision, where the style of an image, modeled by the matrix of channel-wise correlations, is transferred to another image, in order to produce a new image having the style of the first and the content of the second. By drawing analogies between image style and speaker characteristics, and between image content and phonetic sequence, we explore the use of such channel-wise correlations features to train a ResNet architecture in an end-to-end fashion. Our experiments on VoxCeleb demonstrate the effectiveness of the proposed pooling method in speaker recognition.
翻译:以深 2D 共变神经网络提取的音响嵌入器通常以线性层的频道频率对对配第一和第二顺序统计数据的预测为模型,使用平均或仔细的集成时间轴。本文我们研究一种替代的集合方法,即将特定频率的频道对等关系用作统计。这种方法受计算机视觉中风格传输方法的启发,将由频道-相交矩阵模型模型制作的图像样式转移到另一张图像,以便产生一种具有第一层和第二层内容风格的新图像。我们通过在图像样式和发言者特点之间以及图像内容和音频序列之间绘制类比,探索如何使用这种频道-线性关联特征,以端到端的方式培训ResNet结构。我们在VoxCeleb上进行的实验展示了拟议组合方法在语音识别中的有效性。