Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions. We conducted extensive experiments on two datasets using four commonly used encoder and decoder architectures. Experimental results demonstrate the effectiveness of the proposed method. The code is available at https://github.com/PRIS-CV/Caption-Feature-Space-Regularization.
翻译:音频字幕旨在用人文语言描述音频剪辑的内容。由于音频模糊不清,不同的人可能会对相同的音频有不同的看法,从而导致字幕差异(即,一个音频可能会与不同语义的多个字幕相关)。为此,一般音频字幕模型通过随机选择一个相关字幕作为每个音频的地面真相,实现了一对多项培训。然而,它导致优化方向的显著变化,削弱了模型稳定性。为了消除这一负面影响,我们在本文件中提议了两个阶段的音频字幕框架:(一) 在第一阶段,通过对比学习,我们建造一个代理功能空间,以减少与同一音频相关的字幕之间的距离;(二) 在第二阶段,将代理功能空间用作额外的监督,鼓励将模型优化到有利于所有相关字幕的方向上。我们利用四种常用的编码和解码结构在两个数据集上进行了广泛的实验。实验结果展示了拟议方法的有效性。代码可在 https://gistral-Capalive/Cregalizion/Recoder。