From the patter of rain to the crunch of snow, the sounds we hear often convey the visual textures that appear within a scene. In this paper, we present a method for learning visual styles from unlabeled audio-visual data. Our model learns to manipulate the texture of a scene to match a sound, a problem we term audio-driven image stylization. Given a dataset of paired audio-visual data, we learn to modify input images such that, after manipulation, they are more likely to co-occur with a given input sound. In quantitative and qualitative evaluations, our sound-based model outperforms label-based approaches. We also show that audio can be an intuitive representation for manipulating images, as adjusting a sound's volume or mixing two sounds together results in predictable changes to visual style. Project webpage: https://tinglok.netlify.app/files/avstyle
翻译:从大雨到积雪,我们听到的声音常常传达出场景中出现的视觉纹理。 在本文中,我们展示了一种从未贴标签的视听数据学习视觉风格的方法。 我们的模型学会了操纵场景的纹理以匹配声音, 我们用音频驱动的图像标准化来形容一个问题。 根据一组配对的视听数据, 我们学会了修改输入图像, 从而在操作后, 它们更有可能与给定的输入声音同时传播。 在定量和定性评估中, 我们的基于声音的模型优于基于标签的方法。 我们还显示, 音频可以直观地代表着图像的操纵, 作为调整声音的音量或将两种声音混合在一起的结果, 可以预测到视觉风格。 项目网页 : https://tinglok.netlify. ap/files/avstyle