How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model's predictions by applying simple manipulations to the input waveform, or to the latent space.
翻译:声音如何描述我们周围的世界?在这篇论文中,我们提出了一种从声音生成场景图像的方法。我们的方法解决了处理视觉和声音之间存在的巨大差距的挑战。我们设计了一个模型,该模型通过调度每个模型组件的学习过程来关联声音-视觉模态,即使它们之间存在信息差距。关键的想法是通过学习将音频对齐到视觉潜空间,丰富音频特征。我们将输入音频转化为视觉特征,然后使用预训练的生成器生成一幅图像。为了进一步提高我们生成的图像质量,我们使用声源定位来选择具有强交叉模态相关性的音频-视觉对。我们在VEGAS和VGGSound数据集上获得了比先前方法更好的结果。我们还展示了通过对输入波形或潜空间应用简单操作来控制模型预测的能力。