Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.
翻译:深层次的学习正在逐渐作为i-矢量器的可行替代物受到欢迎,供发言者识别。当通过原始语音样本直接提供原始语音样本时,革命神经网络(CNNs)最近取得了有希望的结果。后者不是直接使用标准手工艺特征,而是从波形中学习低层次的语音表达方式,这有可能使网络能够更好地捕捉重要的窄带喇叭特性,如声波和形成器。神经网络的正确设计对于实现这一目标至关重要。本文件提议了一个新的CNN结构,称为SincNet,鼓励第一个进化层发现更有意义的过滤器。SincNet以配带式螺旋形功能为基础,实施带式过滤器。与标准的CNN相比,每个过滤器的所有元素都只直接学习低和高端断层频率,这提供了非常紧凑和高效的方法,可以生成一个专门为理想应用而调整的定制的过滤库。我们在语音识别和发言者核查任务上进行的实验显示,拟议的结构比标准原波形CNN系统更快和表现得更好。