Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models' capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling.
翻译:在音频文本领域,多模态对比学习技术快速成为了一个非常活跃的研究领域。大部分研究都是使用标准的音频检索和分类基准来评估,假设这些模型能够利用自然语言中丰富的信息,以及当前的基准能够捕捉到这些信息的细微差别。在本文中,我们表明现时的音频文本模型尚未真正理解自然语言,尤其是像声音事件的顺序或并发这样的上下文概念。我们的结果表明现有的基准不足以评估这些模型在匹配音频和文本的复杂上下文时的能力。我们提出了一种基于Transformer的架构,并表明,与以往的工作不同,它能够在适当的基准数据下对文本和音频中声音事件之间的顺序关系进行建模。我们强烈主张收集或生成更多的数据,以便未来的研究能够充分利用自然语言进行音频文本建模。