Deep learning has been the subject of growing interest in recent years. Specifically, a specific type called Multimodal learning has shown great promise for solving a wide range of problems in domains such as language, vision, audio, etc. One promising research direction to improve this further has been learning rich and robust low-dimensional data representation of the high-dimensional world with the help of large-scale datasets present on the internet. Because of its potential to avoid the cost of annotating large-scale datasets, self-supervised learning has been the de facto standard for this task in recent years. This paper summarizes some of the landmark research papers that are directly or indirectly responsible to build the foundation of multimodal self-supervised learning of representation today. The paper goes over the development of representation learning over the last few years for each modality and how they were combined to get a multimodal agent later.
翻译:近些年来,深层学习成为人们日益感兴趣的主题。具体地说,一种称为多模式学习的特殊类型在解决语言、视觉、音频等领域的广泛问题方面显示了巨大的希望。 一个有希望的研究方向是,在互联网上大规模数据集的帮助下,学习了高维世界丰富而稳健的低维数据,这是对高维世界的丰富和稳健的低维数据表现。由于它有可能避免大规模数据集的成本,近年来,自我监督的学习已成为这项任务的实际标准。本文总结了一些具有里程碑意义的研究论文,直接或间接地负责建立多式联运自我监督的代表学习基础。论文阐述了过去几年中对每一种模式的代表性学习的发展,以及这些模式如何结合起来以后获得多模式代理。