Convolutional neural networks (CNNs) with log-mel spectrum features have shown promising results for acoustic scene classification tasks. However, the performance of these CNN based classifiers is still lacking as they do not generalise well for unknown environments. To address this issue, we introduce an acoustic spectrum transformation network where traditional log-mel spectrums are transformed into imagined visual features (IVF). The imagined visual features are learned by exploiting the relationship between audio and visual features present in video recordings. An auto-encoder is used to encode images as visual features and a transformation network learns how to generate imagined visual features from log-mel. Our model is trained on a large dataset of Youtube videos. We test our proposed method on the scene classification task of DCASE and ESC-50, where our method outperforms other spectrum features, especially for unseen environments.
翻译:具有日录频谱特征的进化神经网络(CNNs)在声学场景分类任务中显示出了令人乐观的结果,然而,这些CNN分类器的性能仍然缺乏,因为它们无法对未知的环境进行概括。为了解决这一问题,我们引入了一个声学频谱转换网络,将传统的日录光谱转换成想象中的视觉特征(IVF)。通过利用视频录音中的音频和视觉特征之间的关系,可以了解想象的视觉特征。一个自动编码器被用来将图像编码为视觉特征,而一个转换网络则学会如何从日录上生成想象中的视觉特征。我们的模型在Youtube视频的大型数据集上接受培训。我们在DCASE和ESC-50的现场分类任务上测试了我们提出的方法,我们的方法优于其他频谱特征,特别是不可见环境。