Deep learning techniques have led to remarkable breakthroughs in the field of generic object detection and have spawned a lot of scene-understanding tasks in recent years. Scene graph has been the focus of research because of its powerful semantic representation and applications to scene understanding. Scene Graph Generation (SGG) refers to the task of automatically mapping an image into a semantic structural scene graph, which requires the correct labeling of detected objects and their relationships. Although this is a challenging task, the community has proposed a lot of SGG approaches and achieved good results. In this paper, we provide a comprehensive survey of recent achievements in this field brought about by deep learning techniques. We review 138 representative works that cover different input modalities, and systematically summarize existing methods of image-based SGG from the perspective of feature extraction and fusion. We attempt to connect and systematize the existing visual relationship detection methods, to summarize, and interpret the mechanisms and the strategies of SGG in a comprehensive way. Finally, we finish this survey with deep discussions about current existing problems and future research directions. This survey will help readers to develop a better understanding of the current research status and ideas.
翻译:深层学习技术导致在通用物体探测领域取得了显著突破,并产生了近年来许多现场理解的任务。景象图因其强有力的语义表达和用于现场理解的应用而成为研究的重点。景象图(SGG)提到将图像自动映射成一个语义结构图的任务,该图要求正确标出被探测到的物体及其关系。尽管这是一项具有挑战性的任务,但社区已经提出了许多SGG方法,并取得了良好的结果。在本文中,我们提供了对通过深厚学习技术在这一领域取得的最新成就的全面调查。我们审查了138个代表工作,这些工作涉及不同的投入模式,并系统地从地物提取和聚合的角度总结基于图像的 SGG的现有方法。我们试图将现有的视觉关系探测方法连接和系统化,以综合的方式总结和解释SGG的机制和策略。最后,我们通过对目前的问题和今后的研究方向进行深入的讨论来完成这项调查。这项调查将有助于读者更好地了解目前的研究状况和想法。