Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for indoor scene recognition exhibit two limitations: 1) During training, space-independent information, such as color, may hinder optimizing the network's capacity to represent the spatial context. 2) These methods often overlook the differences in coexisting objects across different scenes, suppressing the performance of scene recognition. To address these limitations, we propose SpaCoNet, a novel approach that simultaneously models the Spatial relation and Co-occurrence of objects based on semantic segmentation. Firstly, the semantic spatial relation module (SSRM) is designed to explore the spatial relations among objects within a scene. With the help of semantic segmentation, this module decouples the spatial information from the image, effectively avoiding the influence of irrelevant features. Secondly, both spatial context features from SSRM and deep features from RGB feature extractor are used to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features mentioned above, we employ the self-attention mechanism to explore the long-range co-occurrence relationships among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three publicly available datasets demonstrate the effectiveness and generality of the proposed method. The code will be made publicly available after the blind-review process is completed.
翻译:暂无翻译