Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audio-visual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition. Extensive experimental results on the audio, visual and audio-visual scene recognition datasets show that promising results have been achieved by the AGCN methods. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.
翻译:由于不同物体的音频信号和空间布局以及视觉图像中各种纹理模式中存在不结构的空间-时相关系,视听场景理解是一个具有挑战性的问题。最近,许多研究侧重于进化神经网络的抽象特征,而忽视了声音信号和视觉图像的清晰的语义框架。为此,我们提出了一个端对端框架,即:关注平面平面平面平面平面网络(AGCN),用于结构-观测视听场景展示。首先,声音和输入图像的光谱由地貌提取的骨干网络处理。然后,为了建立多层次的输入特征的分级信息,我们利用一个关注聚集机制,从骨干网络的多层综合特征,而忽视了声音信号和视觉图像图像的突出区域和背景信息。为此目的,我们提出了一个端对端对端平面平面平面平面平面平面平面图(S)的光谱图像图和直面直线图(CVG)的直面直观图(CVG),用于视听场景场景展示演示的图像展示、图像结果的构建图状平面图和图像图像图像图像图像图像显示系统。最后,通过图像图像图像图像图像图像显示的图像显示显示显示显示系统,通过图像的图像显示系统显示系统显示的图像结果,显示系统,显示的图像结果显示系统,在视听结果的图像图中,通过图像图中,以可实现的平面平面平面图显示的图像结果显示的图像结果显示的图像图为可分路路路图,通过可分解。