The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a new network architecture is invented and introduced as the classifier. In this paper, we propose a multi-resolution front-end for feature extraction in an end-to-end classification framework. Optimal weighted combinations of multiple time-frequency resolutions will be learned automatically given the objective of a classification task. Features extracted with different time-frequency resolutions are weighted and concatenated as inputs to the successive networks, where the weights are predicted by a learnable neural network inspired by the weighting block in squeeze-and-excitation networks (SENet). Furthermore, the refinement of the chosen timefrequency resolutions is investigated by pruning the ones with relatively low importance, which reduces the complexity and size of the model. The proposed method is evaluated on the tasks of speech anti-spoofing in ASVSpoof 2019 and its superiority has been justified by comparing with similar baselines.
翻译:选择最佳时空分辨率通常是一个困难但重要的步骤,涉及语音信号分类的任务,例如,言词反伪;不同时间频率分辨率的不同选择,其性能的变异可能与不同模型结构的变异一样大,因此难以判断新网络结构的创建和引入与分类器一样,其改善实际上来自什么。在本文件中,我们提议在终端到终端分类框架内为特征提取工作提供一个多分辨率的前端。鉴于分类任务的目的,将自动学习多种时间频率分辨率的最佳加权组合。不同时间频率分辨率的特性被抽取,与输入连续网络的特性相配为加权和搭配,后者的重量是由挤压和感应网络(SENet)中加权块所激发的可学习神经网络预测的。此外,对所选择的时间频率分辨率的精度进行调查,方法是对重要性相对较低的那些模型进行调整,从而降低模型的复杂性和大小。对以不同时间频率分辨率为不同分辨率的特征的特性进行了加权和配置方法进行了评价,在2019年前对类似基线进行对比。