Cross-modal representation learning allows to integrate information from different modalities into one representation. At the same time, research on generative models tends to focus on the visual domain with less emphasis on other domains, such as audio or text, potentially missing the benefits of shared representations. Studies successfully linking more than one modality in the generative setting are rare. In this context, we verify the possibility to train variational autoencoders (VAEs) to reconstruct image archetypes from audio data. Specifically, we consider VAEs in an adversarial training framework in order to ensure more variability in the generated data and find that there is a trade-off between the consistency and diversity of the generated images - this trade-off can be governed by scaling the reconstruction loss up or down, respectively. Our results further suggest that even in the case when the generated images are relatively inconsistent (diverse), features that are critical for proper image classification are preserved.
翻译:跨模式代表性学习可以将不同模式的信息整合为一个代表。 同时,对基因模型的研究往往侧重于视觉领域,而不太强调其他领域,例如音频或文字,可能缺乏共享演示的好处。研究在基因环境中成功连接不止一种模式是罕见的。在这方面,我们核实了从音频数据中重建图像类型的可能性。具体地说,我们认为,在对抗性培训框架中,VAEs可以确保生成的数据具有更大的变异性,并发现生成图像的一致性和多样性之间存在权衡关系――这种权衡关系可以分别通过重建损失的增减来决定。我们的结果进一步表明,即使生成的图像相对不一致(反向),对适当图像分类至关重要的特征也会保留下来。