Automated audio captioning, a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. Audio captioning requires recognizing the acoustic scene, primary audio events and sometimes the spatial and temporal relationship between events in an audio clip. It also requires describing these elements by a fluent and vivid sentence. Deep learning-based approaches are widely adopted to tackle this problem. This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.
翻译:过去几年来,模拟人类感知以及创新地连接音频处理和自然语言处理的自动音频字幕工作取得了很大进展。音频字幕工作需要识别音频场景、主要音频事件,有时还需要识别音频剪辑中事件之间的空间和时间关系,还需要用流利和生动的句子描述这些要素。为解决这一问题,广泛采用了深层次的学习方法。本文将自身定位为包括基准数据集、现有深层学习技术以及自动音频字幕评估指标的全面审查。