Most audio processing pipelines involve transformations that act on fixed-dimensional input representations of audio. For example, when using the Short Time Fourier Transform (STFT) the DFT size specifies a fixed dimension for the input representation. As a consequence, most audio machine learning models are designed to process fixed-size vector inputs which often prohibits the repurposing of learned models on audio with different sampling rates or alternative representations. We note, however, that the intrinsic spectral information in the audio signal is invariant to the choice of the input representation or the sampling rate. Motivated by this, we introduce a novel way of processing audio signals by treating them as a collection of points in feature space, and we use point cloud machine learning models that give us invariance to the choice of representation parameters, such as DFT size or the sampling rate. Additionally, we observe that these methods result in smaller models, and allow us to significantly subsample the input representation with minimal effects to a trained model performance.
翻译:多数音频处理管道都涉及对音频的固定尺寸输入表示法发挥作用的转换。例如,在使用短时Fourier变换(STFT)时,DFT的大小为输入表示法规定了一个固定的尺寸。因此,大多数音频机学习模型的设计是为了处理固定尺寸矢量输入,常常禁止以不同采样率或替代表示法对音频学模型进行重新定位。然而,我们注意到,音频信号中的内在光谱信息对选择输入表示法或采样率是变化不定的。为此,我们引入了一种处理音频信号的新方式,将它们作为地貌空间各点的集合处理,我们使用点云形机器学习模型,使我们无法选择代表参数,例如DFT大小或采样率。此外,我们观察到,这些方法产生较小的模型,让我们大量将输入表示法进行分解,对经过训练的模型性能产生最小效果。