Most existing deep learning-based acoustic scene classification (ASC) approaches directly utilize representations extracted from spectrograms to identify target scenes. However, these approaches pay little attention to the audio events occurring in the scene despite they provide crucial semantic information. This paper conducts the first study that investigates whether real-life acoustic scenes can be reliably recognized based only on the features that describe a limited number of audio events. To model the task-specific relationships between coarse-grained acoustic scenes and fine-grained audio events, we propose an event relational graph representation learning (ERGL) framework for ASC. Specifically, ERGL learns a graph representation of an acoustic scene from the input audio, where the embedding of each event is treated as a node, while the relationship cues derived from each pair of event embeddings are described by a learned multidimensional edge feature. Experiments on a polyphonic acoustic scene dataset show that the proposed ERGL achieves competitive performance on ASC by using only a limited number of embeddings of audio events without any data augmentations. The validity of the proposed ERGL framework proves the feasibility of recognizing diverse acoustic scenes based on the event relational graph. Our code is available on our homepage (https://github.com/Yuanbo2020/ERGL).
翻译:现有大多数基于深层次学习的声学场景分类(ASC)方法直接利用从光谱图中提取的演示来辨别目标场景,然而,这些方法很少注意现场发生的音频事件,尽管它们提供了关键的语义信息。本文进行第一项研究,仅根据描述有限音频事件的特征来调查真实存在的声频场是否能够得到可靠承认。为模拟粗糙的声学场景与精细音频事件之间的特定任务关系,我们提议为ASC建立一个活动关系图示学习框架。具体地说,ERGL从输入音频中学习声学场的图示,其中将每场活动的嵌入作为节点处理,而从每组活动嵌入的信号则由一个有知识的多层面边缘特征来描述。在多功能声学场景数据集上进行的实验显示,拟议的ERGL通过仅使用数量有限的不增加数据的嵌入音频场活动来取得竞争性的表演。 拟议的ERGL框架的有效性证明,在我们的图像关系上确认多种图像的可行性。