声频场分类跨模式光谱转换网络 (Cross-modal Spectrum Transformation Network For Acoustic Scene classification)

Convolutional neural networks (CNNs) with log-mel spectrum features have shown promising results for acoustic scene classification tasks. However, the performance of these CNN based classifiers is still lacking as they do not generalise well for unknown environments. To address this issue, we introduce an acoustic spectrum transformation network where traditional log-mel spectrums are transformed into imagined visual features (IVF). The imagined visual features are learned by exploiting the relationship between audio and visual features present in video recordings. An auto-encoder is used to encode images as visual features and a transformation network learns how to generate imagined visual features from log-mel. Our model is trained on a large dataset of Youtube videos. We test our proposed method on the scene classification task of DCASE and ESC-50, where our method outperforms other spectrum features, especially for unseen environments.

翻译：具有日录频谱特征的进化神经网络(CNNs)在声学场景分类任务中显示出了令人乐观的结果,然而,这些CNN分类器的性能仍然缺乏,因为它们无法对未知的环境进行概括。为了解决这一问题,我们引入了一个声学频谱转换网络,将传统的日录光谱转换成想象中的视觉特征(IVF)。通过利用视频录音中的音频和视觉特征之间的关系,可以了解想象的视觉特征。一个自动编码器被用来将图像编码为视觉特征,而一个转换网络则学会如何从日录上生成想象中的视觉特征。我们的模型在Youtube视频的大型数据集上接受培训。我们在DCASE和ESC-50的现场分类任务上测试了我们提出的方法,我们的方法优于其他频谱特征,特别是不可见环境。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

25+阅读 · 2020年5月22日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日