Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
翻译:近年来,人工智能(AI)及其应用引起了极大的兴趣,这一成就可部分归功于在包括机器学习(ML)、计算机视野(CV)和自然语言处理(NLP)在内的人工智能子领域取得的进展。深识是采用人工神经网络概念的机器学习的一个子领域,它使得这些领域的快速发展成为了最迅速的动力。由于这一原因,视觉和语言的融合引起了许多关注。这些任务的制定方式适当体现了深层次学习的概念。在本审查文件中,我们透彻和广泛地审查了艺术方法的状况、关键模型设计原则,并讨论了现有的数据集、方法、问题拟订和 VQA 和视觉推理任务的评价措施,以了解视觉和视觉思维和语言的学习。我们还提出了这个研究领域一些潜在的未来道路,希望我们的研究能够产生新的想法和新办法,以处理现有的困难和开发新的应用。