Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches. Moreover, we categorize existing knowledge sources and types, proceeding to discussion regarding the KG vs LLM dilemma and its potential impact to future hybrid approaches.
翻译:近来在语言学(VL)学习方面取得的进步使得能够开发多种模式和技术,这些模式和技术能够提供若干令人印象深刻的实施,目前能够解决各种需要视觉和语言合作的任务。目前VL培训前使用的数据集仅包含有限的视觉和语言知识,从而严重限制了许多VL模型的普及能力。知识图和大语言模型等外部知识来源能够填补缺失的知识,从而弥补这种普遍化差距,导致混合结构的出现。在目前的调查中,我们分析了从这种混合方法中受益的任务。此外,我们对现有的知识来源和类型进行了分类,进而讨论了KG对LLM的困境及其对未来混合方法的潜在影响。</s>