Acoustic scene classification is a process of characterizing and classifying the environments from sound recordings. The first step is to generate features (representations) from the recorded sound and then classify the background environments. However, different kinds of representations have dramatic effects on the accuracy of the classification. In this paper, we explored the three such representations on classification accuracy using neural networks. We investigated the spectrograms, MFCCs, and embeddings representations using different CNN networks and autoencoders. Our dataset consists of sounds from three settings of indoors and outdoors environments - thus the dataset contains sound from six different kinds of environments. We found that the spectrogram representation has the highest classification accuracy while MFCC has the lowest classification accuracy. We reported our findings, insights as well as some guidelines to achieve better accuracy for environment classification using sounds.
翻译:声学场景分类是一个从录音中对环境进行定性和分类的过程,第一步是从录音中产生特征(代表),然后对背景环境进行分类。但是,不同种类的表述对分类的准确性有重大影响。在本文件中,我们探讨了使用神经网络进行分类准确性的三种表述。我们调查了光谱图、MFCC以及使用不同的CNN网络和自动编码器进行嵌入演示。我们的数据集由来自室内和室外环境三种环境的音频组成,因此数据集包含来自六种不同环境的音频。我们发现光谱代表具有最高的分类准确性,而MFCC的分类准确性最低。我们报告了我们的调查结果、洞察力以及一些准则,以便利用声音实现环境分类的更准确性。