【泡泡一分钟】视觉关系检测中对语境感知的交互识别（ICCV2017-57）

会员服务 ·

【泡泡一分钟】视觉关系检测中对语境感知的交互识别（ICCV2017-57）

2018 年 7 月 29 日 泡泡机器人SLAM

每天一分钟，带你读遍机器人顶级会议文章

标题：Towards Context-aware Interaction Recognition for Visual Relationship Detection

作者：Bohan Zhuang, Lingqiao Liu, Chunhua Shen and Ian Reid

来源：International Conference on Computer Vision (ICCV 2017)

播音员：糯米

编译：林旭滨周平（62）

欢迎个人转发朋友圈；其他机构或自媒体如需转载，后台留言申请授权

摘要

识别物体间的交互信息是视觉识别领域里一个重要的任务。如果我们定义交互的语境（context of the interaction）是与物体关联的，则现有的方法可以分为两类：（i）基于交互——语境组合训练一个单分类器；（ii）独立于显性语境地识别物体交互。两种方法都有其局限性：前者随着交互——语境组合数量增加而扩展的性能不佳，并且难以适用于未见过的交互语境组合；而后者则由于语境无关的交互分类器设计的困难性，常常造成较差的交互识别效果。

图1. 左边是(i)类方法，右边是(ii)类方法，中间是本文所提框架

为了缓解这些不足，本文提出一种基于语境感知的交互识别框架。我们的方法的关键之处在于显式构造了一个结合了语境及交互信息的交互分类器。语境通过word2vec编码转化至语义空间（semantic space）,并且用于推导交互的分类结果。所提方法和上述类型（ii）一样，对每一个交互都建立一个分类器，不过所建立的分类器通过依赖于语境信息的权重，能实现适应语境的作用。用语义空间的好处是它能很自然地实现零样本（zero-shot）通用化，即意义相似的语境信息（主体——客体对（subject-object pairs））可以被识别为某个交互的语境，即使它们在训练集中并未出现过。

由于我们的模型参数不会随着交互数目的增加而增加，我们的方法随着交互——语境对（interaction-context pairs）数量增加也具有良好的扩展性。因此我们的方法避免了上述两类方法的局限性。我们在已有的数据集上都进行实验，结果表明我们的方法在性能上比已有的交互表示方法都要更胜一筹。

图2. 所提方法数据集实验效果，绿色方框是本方法识别的交互结果，红色方框是groundtruth

Abstract

Recognizing how objects interact with each other is a crucial task in visual recognition. If we define the context of the interaction to be the objects involved, then most current methods can be categorized as either: (i) training a single classifier on the combination of the interaction and its context; or (ii) aiming to recognize the interaction independently of its explicit context. Both methods suffer limitations: the former scales poorly with the number of combinations and fails to generalize to unseen combinations, while the latter often leads to poor interaction recognition performance due to the difficulty of designing a contextindependent interaction classifier.

To mitigate those drawbacks, this paper proposes an alternative, context-aware interaction recognition framework. The key to our method is to explicitly construct an interaction classifier which combines the context, and the interaction. The context is encoded via word2vec into a semantic space, and is used to derive a classification result for the interaction. The proposed method still builds one classifier for one interaction (as per type (ii) above), but the classifier built is adaptive to context via weights which are context dependent. The benefit of using the semantic space is that it naturally leads to zero-shot generalizations in which semantically similar contexts (subject-object pairs) can be recognized as suitable contexts for an interaction, even if they were not observed in the training set.

Our method also scales with the number of interaction-context pairs since our model parameters do not increase with the number of interactions. Thus our method avoids the limitation of both approaches. We demonstrate experimentally that the proposed framework leads to improved performance for all investigated interaction representations and datasets.

如果你对本文感兴趣，想要下载完整文章进行阅读，可以关注【泡泡机器人SLAM】公众号（paopaorobot_slam）。

欢迎来到泡泡论坛，这里有大牛为你解答关于SLAM的任何疑惑。

有想问的问题，或者想刷帖回答问题，泡泡论坛欢迎你！

泡泡网站：www.paopaorobot.org

泡泡论坛：http://paopaorobot.org/forums/