What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. HEAR was launched as a NeurIPS 2021 shared challenge. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.
翻译:何种音频嵌入方法在不作微调的情况下,将哪些音频嵌入方法最能概括到各种日常领域的广泛下游任务? 听力基准的目的是开发一个通用的音频代表模式,为在各种各样的任务和情景下学习奠定坚实的基础。 听力嵌入方法利用包括演讲、环境声音和音乐在内的各种领域的基准套件对音频表述进行评估。 听力是作为NeurIPS 2021年的共同挑战而启动的。本着共同交流的精神,每个参与者都提交了一个音频嵌入模式,该模式遵循通用、开放源码和可自由使用的共同API。 13个外部小组的29个模式根据16套数据集产生的19项不同下游任务进行了评价。 开放评价代码、提交的模型和数据集是关键贡献,有利于全面和可复制的评价,以及以前不可能进行的长纵向研究。 单一的通用音频代表能否像人类耳一样整体地发挥作用,仍然是一个未决问题。