There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation.
翻译:一些技术显示,在使用全球监测网络时,为一种模式生成多媒体数据,例如能够生成图像、视频和音频,然而,迄今为止,多模式生成数据的任务,特别是用于视听和视频的数据,尚未充分很好地探索。为此,我们提议了一种方法,表明我们能够通过联合制作音频和视频模式来生成视频和音频数据的自然样本。拟议方法使用多重歧视器,确保音频、视频和联合输出也无法与现实世界样本区分。我们为此提供了数据集,并表明我们能够生成现实的样本。这种方法通过各种标准指标,如感知分、Frechet感知距离(FID)和人类评估,得到验证。