As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to reason along with the structures but provide higher interpretability for model decisions. Nevertheless, these representations receive much less attention than traditional recognition tasks, leaving numerous open challenges unsolved. In the thesis, we study how machines can describe the content of the individual image or video with visual relationships as the structured representations. Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings, with improvements resulting from external knowledge incorporation, bias-reducing mechanism, and enhanced representation models. At the end of this thesis, we also discuss some open challenges and limitations to shed light on future directions of structured representation learning for visual scenes.
翻译:由于介于两个层面的中级代表机构,结构化的视觉场景(如对称天体之间的视觉关系)的表述方式被证明不仅有利于在学习理性时与结构相结合的构成模式,而且为模型决定提供更高的解释性,然而,这些表述方式比传统的承认任务受到的关注要少得多,使得许多公开的挑战得不到解决。在论文中,我们研究机器如何将个人图像或视频的内容描述成视觉关系,作为结构化的表述方式。具体地说,我们探讨如何在静态图像和视频环境中有效地构建和学习结构化的视觉场景的表述方式,并改进外部知识集成、减少偏见机制以及强化的代表模式。在论文的结尾,我们还讨论了一些公开的挑战和局限性,以揭示视觉场面结构化教学的未来方向。