Context, as referred to situational factors related to the object of interest, can help infer the object's states or properties in visual recognition. As such contextual features are too diverse (across instances) to be annotated, existing attempts simply exploit image labels as supervision to learn them, resulting in various contextual tricks, such as features pyramid, context attention, etc. However, without carefully modeling the context's properties, especially its relation to the object, their estimated context can suffer from large inaccuracy. To amend this problem, we propose a novel Contextual Latent Generative Model (Context-LGM), which considers the object-context relation and models it in a hierarchical manner. Specifically, we firstly introduce a latent generative model with a pair of correlated latent variables to respectively model the object and context, and embed their correlation via the generative process. Then, to infer contextual features, we reformulate the objective function of Variational Auto-Encoder (VAE), where contextual features are learned as a posterior distribution conditioned on the object. Finally, to implement this contextual posterior, we introduce a Transformer that takes the object's information as a reference and locates correlated contextual factors. The effectiveness of our method is verified by state-of-the-art performance on two context-aware object recognition tasks, i.e. lung cancer prediction and emotion recognition.
翻译:由于这些背景特征过于多样(跨实例),无法加以说明,现有尝试只是利用图像标签作为监督来学习这些特征,结果产生各种背景技巧,例如金字塔特征、上下文关注等。然而,如果不仔细模拟背景属性,特别是其与对象的关系,其估计背景可能会受到巨大的不准确性的影响。为了修正这一问题,我们建议了一个新的“背景时尚生成模型”(Context-LGM),该模型以等级方式考虑对象-文字关系和模型。具体地说,我们首先引入了一种具有相关关联性的潜在变量的潜变型模型,以分别模拟对象和背景,并通过变相过程将其关联性联系起来。然而,为了推断背景特征,我们重新定义了Variational Auto-Encoder(VAEEEE)的目标功能,在那里学习了背景特征,作为该对象的后继号分布条件。最后,为了执行这一背景时序的图像和模型,我们引入了一种潜在潜在变相变量模型模型,通过对目标的直系信息进行识别。