Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent detailed information about the underlying signal. Using heterogeneous graph approaches to address the task of visually-aware acoustic event classification, which serves as a compact, efficient and scalable way to represent data in the form of graphs. Through heterogeneous graphs, we show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on AudioSet, a large benchmark, shows that our model achieves state-of-the-art performance.
翻译:对听觉活动的认知本质上是依赖视听提示的多式联运。许多现有的多式联运做法都采用特定模式模式,然后结合嵌入,对联合信息进行编码。相反,我们使用多式图表明确捕捉各种模式之间的空间和时间关系,并代表关于基本信号的详细信息。使用多式图表方法处理视觉觉知声事件分类的任务,这是以图表形式代表数据的紧凑、高效和可扩展的方式。通过多式图表,我们展示了空间和时间尺度上内部和时际关系的有效建模。我们的模型可以通过相关的超参数很容易地适应不同事件的规模。关于音频网的实验是一个大基准,它表明我们的模型实现了最先进的性能。