In this paper, we propose an efficient MLP-based approach for learning audio representations, namely timestamp and scene-level audio embeddings. We use an encoder consisting of sequentially stacked gated MLP blocks, which accept 2D MFCCs as inputs. In addition, we also provide a simple temporal interpolation-based algorithm for computing scene-level embeddings from timestamp embeddings. The audio representations generated by our method are evaluated across a diverse set of benchmarks at the Holistic Evaluation of Audio Representations (HEAR) challenge, hosted at the NeurIPS 2021 competition track. We achieved first place on the Speech Commands (full), Speech Commands (5 hours), and the Mridingham Tonic benchmarks. Furthermore, our approach is also the most resource-efficient among all the submitted methods, in terms of both the number of model parameters and the time required to compute embeddings.
翻译:在本文中,我们建议一种基于MLP的高效音频演示方法,即时印和场景级音频嵌入器。我们使用由顺序堆叠的门式MLP区块组成的编码器,接受2D MFCC作为投入。此外,我们还提供了一种简单的基于时间的内插算法,用于计算从时间戳嵌入的场景层嵌入。在NeurIPS 2021 竞赛轨道上,对我们方法产生的音频演示进行了一系列不同的基准评估。我们在语音指令(完整)、语音指令(5小时)和Mrididham Tonic基准中取得了第一位。此外,从模型参数的数量和计算嵌入所需的时间来看,我们的方法也是所有提交的方法中资源效率最高的方法。