Most existing deep learning-based acoustic scene classification (ASC) approaches directly utilize representations extracted from spectrograms to identify target scenes. However, these approaches pay little attention to the audio events occurring in the scene despite they provide crucial semantic information. This paper conducts the first study to investigate whether real-life acoustic scenes can be reliably recognized based only on the features that describe a limited number of audio events. To model the task-specific relationships between coarse-grained acoustic scenes and fine-grained audio events, we propose an event relational graph representation learning (ERGL) framework for ASC. Specifically, the ERGL learns a graph representation of an acoustic scene from the input audio, where the embedding of each event is treated as a node, while the relationship cues derived from each pair of event embeddings are described by a learned multi-dimensional edge feature. Experiments on a polyphonic acoustic scene dataset show that the proposed ERGL achieves competitive performance on ASC by using only a limited number of embeddings of audio events without any data augmentations. The validity of the proposed ERGL framework proves the feasibility of recognizing diverse acoustic scenes based on the event relational graph. Our code is available on project homepage (https://github.com/Yuanbo2020/ERGL).
翻译:现有大多数基于深层次学习的声学场景分类(ASC)方法直接利用从光谱图中提取的演示来辨别目标场景,然而,这些方法很少注意现场发生的音频事件,尽管它们提供了关键的语义信息。本文进行第一次研究,调查实际存在的声频场是否仅根据描述少数音频事件的特征才能可靠地得到承认。模拟粗糙的声学场景与微细音频事件之间的特定任务关系,我们提议为ASC建立一个事件关系图显示学习框架。具体地说,ERGL从输入音频中学习声学场的图表表示,其中将每场活动的嵌入作为节点处理,而从每组活动嵌入的信号则由学习的多维边缘特征来描述。多功能声学场景数据集实验显示,拟议的ERGL通过仅使用数量有限的不增强数据的音频事件嵌入(ERGL)。 ERGL框架的正确性能证明基于不同声学场景的图像(OURL20/COrmainal com)项目的可行性。