Audio captioning is a task that generates description of audio based on content. Pre-trained models are widely used in audio captioning due to high complexity. Unless a comprehensive system is re-trained, it is hard to determine how well pre-trained models contribute to audio captioning system. To prevent the time consuming and energy consuming process of retraining, it is necessary to propose a preditor of performance for the pre-trained model in audio captioning. In this paper, a series of pre-trained models are investigated for the correlation between extracted audio features and the performance of audio captioning. A couple of predictor is proposed based on the experiment results.The result demonstrates that the kurtosis and skewness of audio features extracted may act as an indicator of the performance of audio captioning systems for pre-trained audio due to the high correlation between kurtosis and skewness of audio features and the performance of audio captioning systems.
翻译:音频字幕是一项根据内容生成音频描述的任务。由于高度复杂,预先培训的模式广泛用于音频字幕。除非对综合系统进行再培训,否则很难确定经过培训的模式对音频字幕系统的贡献有多好。为了防止再培训的时间消耗和耗能过程,有必要为预先培训的音频字幕模型提出一个性能预演器。本文对一系列预先培训的模式进行了调查,以确定提取的音频特征与音频字幕的性能之间的相互关系。根据实验结果提出了若干预测器。结果显示,由于音频字幕与音频字幕系统的性能高度相关,所提取的音频特征的轮廓和偏差可能作为预培训的音频字幕系统性能的指标。