Pooling is needed to aggregate frame-level features into utterance-level representations for speaker modeling. Given the success of statistics-based pooling methods, we hypothesize that speaker characteristics are well represented in the statistical distribution over the pre-aggregation layer's output, and propose to use transport-oriented feature aggregation for deriving speaker embeddings. The aggregated representation encodes the geometric structure of the underlying feature distribution, which is expected to contain valuable speaker-specific information that may not be represented by the commonly used statistical measures like mean and variance. The original transport-oriented feature aggregation is also extended to a weighted-frame version to incorporate the attention mechanism. Experiments on speaker verification with the Voxceleb dataset show improvement over statistics pooling and its attentive variant.
翻译:鉴于基于统计数据的集合方法的成功,我们假设发言者的特征在分类前层产出的统计分布中占有充分的代表性,并提议使用面向运输的特征汇总来生成演讲者嵌入器。综合代表将基本特征分布的几何结构编码为基本特征分布的几何结构,该结构预计将包含可能不以诸如平均值和差异等常用统计措施为代表的针对发言者的宝贵信息。最初的面向运输的特征汇总还扩展为加权框架版本,以纳入注意机制。关于与Voxceleb数据集进行发言者核查的实验显示,在统计集合及其关注变量方面有所改进。