Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. Moreover, we discuss open challenges and envisage possible future research directions.
翻译:自动音频字幕是一项跨模式的翻译任务,旨在为特定音频剪辑制作自然语言描述,近年来,随着免费提供数据集的发布,这项任务日益受到重视,该问题主要通过深层学习技术得到解决,提出了许多办法,例如调查不同的神经网络结构,利用关键词或句子信息等辅助信息指导字幕生成,以及采用各种培训战略,大大促进了该领域的发展。我们在本文件中全面审查了从现有各种评价指标和数据集的方法来看,在自动音频字幕中发表的贡献。此外,我们讨论了公开的挑战,并设想了今后可能的研究方向。