Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task - an LSTM-based multimodal binary classifier for 'yes' or 'no' type answers and an LSTM-based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at https://zenodo.org/record/6473207.
翻译:音频解答( AQA) 是一个多式翻译任务, 一个系统分析音频信号和一个自然语言问题, 以生成一个合适的自然语言解答。 在本文中, 我们引入了 Clotho- AQA, 音频解答数据集, 由1991 音频解答的数据集, 由从 Clotho 数据集中选择的15 至 30 秒不等的时间段组成。 对于每个音频解答( AQA), 我们收集了六个不同的问题, 并用亚马逊机械土耳其语收集了相应答案。 问答由不同的发音员提出。 每个音频解的六个问题中, 两个问题都设计为“ 是 ” 和“ 不 ”, 其余两个问题都有其他单词解答。 对于每一个问题, 我们收集了三个不同的音解解答的数据集, 包括1991 15 至 秒的音频 。 我们提出两个基线实验, 描述我们使用以“ 是 ” ” 或“ 不 类型解 ” 和 LSTM- modal 73 的多级分类解解解答828 单词解答案。 中,, 两个解题的精度达到62. 77% 和 多级 的精确度 。