Recently, researchers have gradually realized that in some cases, the self-supervised pre-training on large-scale Internet data is better than that of high-quality/manually labeled data sets, and multimodal/large models are better than single or bimodal/small models. In this paper, we propose a robust audio representation learning method WavBriVL based on Bridging-Vision-and-Language (BriVL). WavBriVL projects audio, image and text into a shared embedded space, so that multi-modal applications can be realized. We demonstrate the qualitative evaluation of the image generated from WavBriVL as a shared embedded space, with the main purposes of this paper: (1) Learning the correlation between audio and image; (2) Explore a new way of image generation, that is, use audio to generate pictures. Experimental results show that this method can effectively generate appropriate images from audio.
翻译:最近,研究人员逐渐认识到,在某些情况下,关于大型互联网数据的自我监督前培训优于高质量/人工标签数据集,多式联运/大型模型优于单一或双式/小型模型。在本论文中,我们提议了一种基于Briing-Vision-and-Language(BriVL)的强有力的音频代表学习方法WavBriVL。WavBriVL将音频、图像和文本投入一个共享的嵌入空间,以便实现多式应用。我们展示了对WavBriVL作为共享嵌入空间产生的图像的质量评估,其主要目的是:(1) 学习音频和图像之间的相互关系;(2) 探索一种新的图像生成方式,即使用音频生成图片。实验结果表明,这种方法可以有效地从音频中生成适当的图像。</s>