We propose a learnable content adaptive front end for audio signal processing. Before the modern advent of deep learning, we used fixed representation non-learnable front-ends like spectrogram or mel-spectrogram with/without neural architectures. With convolutional architectures supporting various applications such as ASR and acoustic scene understanding, a shift to a learnable front ends occurred in which both the type of basis functions and the weight were learned from scratch and optimized for the particular task of interest. With the shift to transformer-based architectures with no convolutional blocks present, a linear layer projects small waveform patches onto a small latent dimension before feeding them to a transformer architecture. In this work, we propose a way of computing a content-adaptive learnable time-frequency representation. We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector. It is akin to learning a bank of finite impulse-response filterbanks and passing the input signal through the optimum filter bank depending on the content of the input signal. A content-adaptive learnable time-frequency representation may be more broadly applicable, beyond the experiments in this paper.
翻译:我们提出了一种学习型的内容自适应前端,用于音频信号处理。在深度学习出现之前,我们使用固定表示的非可学习的前端,如频谱图或有/无神经结构的梅尔频谱图。具有各种应用的卷积体系结构(例如ASR和声学场景理解)之后,转向了可学习的前端,其中基础函数的类型和权重均从头开始学习并针对特定的目标进行优化。随着没有卷积块的基于transformer的体系结构的出现,线性层将小的波形切片投影到小的潜变量维度中,然后将它们馈入transformer体系结构。在这项工作中,我们提出了一种计算内容自适应的可学习时间-频率表示的方法。我们将每个音频信号通过卷积滤波器组传递,每个滤波器都给出一个固定维度的向量。这类似于学习一组有限冲激响应滤波器组,并根据输入信号的内容将输入信号通过最优滤波器组进行传递。除本文中的实验外,内容自适应的可学习时间-频率表示方法可能具有更广泛的适用性。