Machine hearing of the environmental sound is one of the important issues in the audio recognition domain. It gives the machine the ability to discriminate between the different input sounds that guides its decision making. In this work we exploit the self-supervised contrastive technique and a shallow 1D CNN to extract the distinctive audio features (audio representations) without using any explicit annotations.We generate representations of a given audio using both its raw audio waveform and spectrogram and evaluate if the proposed learner is agnostic to the type of audio input. We further use canonical correlation analysis (CCA) to fuse representations from the two types of input of a given audio and demonstrate that the fused global feature results in robust representation of the audio signal as compared to the individual representations. The evaluation of the proposed technique is done on both ESC-50 and UrbanSound8K. The results show that the proposed technique is able to extract most features of the environmental audio and gives an improvement of 12.8% and 0.9% on the ESC-50 and UrbanSound8K datasets respectively.
翻译:听环境声音的机器听觉是音频识别领域的一个重要问题。 它使机器能够区分指导其决策的不同输入声音。 在这项工作中,我们利用自我监督的对比技术和浅1DCNN来提取独特的音频特征(音频表示器)而不使用任何明确的说明。 我们利用原始音频波形和光谱对特定音频进行表达,并评价拟议学习者是否对音频输入的类型具有不可知性。 我们还使用直线相关分析(CCA)来将特定音频输入的两种输入的表示器引信化,并表明与单个表示器相比,结合的全球特征使音频信号得到强有力的反映。 对拟议技术的评价是在ESC-50和UrbanSound8K上进行的。 结果表明,拟议的技术能够提取环境音频的大部分特征,使ESC-50和UrbanSound8K数据集分别改进了12.8%和0.9%。