General-purpose embedding is highly desirable for few-shot even zero-shot learning in many application scenarios, including audio tasks. In order to understand representations better, we conducted a thorough error analysis and visualization of HEAR 2021 submission results. Inspired by the analysis, this work experiments with different front-end audio preprocessing methods, including Constant-Q Transform (CQT) and Short-time Fourier transform (STFT), and proposes a Batch Embedding Covariance Regularization (BECR) term to uncover a more holistic simulation of the frequency information received by the human auditory system. We tested the models on the suite of HEAR 2021 tasks, which encompass a broad category of tasks. Preliminary results show (1) the proposed BECR can incur a more dispersed embedding on the test set, (2) BECR improves the PaSST model without extra computation complexity, and (3) STFT preprocessing outperforms CQT in all tasks we tested. Github:https://github.com/ankitshah009/general_audio_embedding_hear_2021
翻译:在许多应用情景中,包括音频任务中,为更深入地理解演示,我们进行了彻底的错误分析,并对2021年的提交结果进行了可视化。在分析的启发下,我们用不同前端的音频预处理方法,包括常数Q变换(CQT)和短时Fourier变换(STFT),进行了这项工作实验,并提议用批量嵌入常态常规化(BECR)术语,以发现对人类听觉系统收到的频率信息的更全面模拟。我们测试了2021年听觉任务套件的模型,其中包括广泛的任务类别。初步结果显示:(1) 拟议的BECR可以在测试集上进行更分散的嵌入,(2) BECR在不增加计算复杂性的情况下改进PASST模型,(3) STF 预处理在我们测试的所有任务中超越CQT。 Github:https://github.com/ankitshah009//gener_audidding_heard_he_he_deard_2021)。</s>