Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences that entail a complex hierarchical structure. Signals may contain diverse information at each time-frequency (TF) location. For example, it may be more beneficial to focus on high-energy parts for phoneme classes such as fricatives. The standard convolutional layer that operates on neighboring local regions cannot capture the complex TF global context information. In this study, a general global time-frequency context modeling framework is proposed to leverage the context information specifically for speaker representation modeling. First, a data-driven attention-based context model is introduced to capture the long-range and non-local relationship across different time-frequency locations. Second, a data-independent 2D-DCT based context model is proposed to improve model interpretability. A multi-DCT attention mechanism is presented to improve modeling power with alternate DCT base forms. Finally, the global context information is used to recalibrate salient time-frequency locations by computing the similarity between the global context and local features. The proposed lightweight blocks can be easily incorporated into a speaker model with little additional computational costs and effectively improves the speaker verification performance compared to the standard ResNet model and Squeeze\&Excitation block by a large margin. Detailed ablation studies are also performed to analyze various factors that may impact performance of the proposed individual modules. Results from experiments show that the proposed global context modeling framework can efficiently improve the learned speaker representations by achieving channel-wise and time-frequency feature recalibration.
翻译:语言信号是高维、长长和多变的顺序,需要复杂的等级结构。首先,以数据为驱动力的网络效果路由模式,以捕捉不同时间频率地点的远程和非本地关系。第二,提议一个基于2D-DCT的数据依赖型背景模型来改进模型的可解释性。多DCT关注机制用来用替代的 DCT 基本表格改进建模能力。最后,使用全球背景信息来调整显著的时间频率定位,方法是计算全球背景和当地特点之间的相似性。拟议的光度比值分析工具可以很容易地将光度分析比值纳入大型分析成本。拟议的光度分析模型和成本分析工具可以很容易地将光度分析比值推到大型分析成本。