Video to sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls or specializations of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we pose the task of generating sound with a specific timbre given a video input and a reference audio sample. To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information. We first use three encoders to encode these components respectively: 1) a temporal encoder to encode temporal information, which is fed with video frames since the input video shares the same temporal information as the original audio; 2) an acoustic encoder to encode timbre information, which takes the original audio as input and discards its temporal information by a temporal-corrupting operation; and 3) a background encoder to encode the residual or background sound, which uses the background part of the original audio as input. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
翻译:声音生成的视频旨在生成符合现实和自然的声音,同时提供视频输入。然而,以前的视频到声音生成方法只能产生随机或平均的音调,而不会对生成的音调进行任何控制或专门化,从而导致人们有时无法在这些方法下获得理想的音调问题。在本文中,我们的任务是用特定的音调生成声音,给视频输入和参考音频样本提供特定的音调。为了完成这项任务,我们将每个目标音频分解成三个组成部分:时间信息、声频信息和背景资料。我们首先使用三个编码器分别编码这些组成部分:1)一个时间编码器,用于编码时间信息,并配有视频框架,因为输入的音频信息与原始音频分享相同的时间信息;2)一个音调编码信息,将原始音频作为输入的音调,通过时间调操作丢弃其时间信息;以及3)一个背景编码器,用于将残余音频或背景音调编码,将原始音频的背景部分用作输入。为了使生成的结果在质量和时间上更加一致,我们用视频框架的图像框架将用一个高质量和高的模调模型,我们用一种高压数据模拟的模化方法来制作一个高级数据演示。