Multiple modalities for certain information provide a variety of perspectives on that information, which can improve the understanding of the information. Thus, it may be crucial to generate data of different modality from the existing data to enhance the understanding. In this paper, we investigate the cross-modal audio-to-image generation problem and propose Cross-Modal Contrastive Representation Learning (CMCRL) to extract useful features from audios and use it in the generation phase. Experimental results show that CMCRL enhances quality of images generated than previous research.
翻译:某些信息有多种模式,就这些信息提供多种观点,可以增进对信息的了解,因此,从现有数据中产生不同模式的数据,对于增进了解可能至关重要,在本文件中,我们调查了跨模式音频到图像生成问题,并提议采用跨模式反代表性学习,从音频中提取有用的特征,并在生成阶段加以使用。