The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.
翻译:从文字、视听数据等多种模式共同学习的能力是智能系统的一个决定性特征。虽然在设计神经网络以利用多式联运数据方面取得了有希望的进展,但数据扩充的巨大成功目前仍然局限于图像分类等单一模式任务。事实上,在保留数据的总体语义结构的同时,特别难以扩大每种模式;例如,在应用了标准增强(如翻译)之后,字幕可能不再是对图像的正确描述。此外,要具体说明不适应特定模式的合理转换是困难的。我们在本文件中采用了LEMDA、学习多模式数据增强这一易于使用的方法,即自动学习在地物空间共同增加多模式数据,而不会限制模式特性或模式之间的关系。我们表明,LEMDA能够(1) 大大改进多模式深层次学习结构的性能,(2) 适用于以前未曾考虑的模式组合,(3) 在包括图像、文本和表格数据在内的广泛应用中实现最新结果。