Unseen noise signal which is not considered in a model training process is difficult to anticipate and would lead to performance degradation. Various methods have been investigated to mitigate unseen noise. In our previous work, an Instance-level Dynamic Filter (IDF) and a Pixel Dynamic Filter (PDF) were proposed to extract noise-robust features. However, the performance of the dynamic filter might be degraded since simple feature pooling is used to reduce the computational resource in the IDF part. In this paper, we propose an efficient dynamic filter to enhance the performance of the dynamic filter. Instead of utilizing the simple feature mean, we separate Time-Frequency (T-F) features as non-overlapping chunks, and separable convolutions are carried out for each feature direction (inter chunks and intra chunks). Additionally, we propose Dynamic Attention Pooling that maps high dimensional features as low dimensional feature embeddings. These methods are applied to the IDF for keyword spotting and speaker verification tasks. We confirm that our proposed method performs better in unseen environments (unseen noise and unseen speakers) than state-of-the-art models.
翻译:在模型培训过程中没有考虑到的未见噪音信号很难预测,而且会导致性能退化。已经调查了各种方法以缓解不可见噪音。在我们以前的工作中,曾提议提取噪音-气压特性;然而,动态过滤器的性能可能会退化,因为使用简单的特性集合来减少以色列国防军部分的计算资源。在本文中,我们提议了一个有效的动态过滤器,以提高动态过滤器的性能。我们没有使用简单特性,而是将时间-公平(T-F)作为非重叠区块,而为每个特性方向(间块和内块)进行分立变。此外,我们提议动态注意,将高维特性绘制成低维特性嵌入式。这些方法适用于以色列国防军关键词定位和发言者核实任务。我们确认,我们提出的方法在看不见的环境中(看不到噪音和看不见的发言者)比在状态模型中表现更好。