As one of the challenging problems in video search, Person-Action Instance Search (INS) aims to retrieve shots with specific person carrying out specific action from massive video shots. Existing methods mainly include two steps: First, two individual INS branches, i.e., person INS and action INS, are separately conducted to compute the initial person and action ranking scores; Second, both scores are directly fused to generate the final ranking list. However, direct aggregation of two individual INS scores cannot guarantee the identity consistency between person and action. For example, a shot with "Pat is standing" and "Ian is sitting on couch" may be erroneously understood as "Pat is sitting on couch" or "Ian is standing". To address the above identity inconsistency problem (IIP), we study a spatio-temporal identity verification method. Specifically, in the spatial dimension, we propose an identity consistency verification scheme to optimize the direct fusion score of person INS and action INS. The motivation originates from an observation that face detection results usually locate in the identity-consistent action bounding boxes. Moreover, in the temporal dimension, considering the complex filming condition, we propose an inter-frame detection extension operation to interpolate missing face/action detection results in successive video frames. The proposed method is evaluated on the large scale TRECVID INS dataset, and the experimental results show that our method can effectively mitigate the IIP and surpass the existing second places in both TRECVID 2019 and 2020 INS tasks.
翻译:作为视频搜索中具有挑战性的问题之一,个人-行动现场搜索(INS)旨在让执行具体行动的具体个人从大规模视频镜头中拿回镜头,现有方法主要包括两个步骤:首先,单独进行两个INS分支,即个人INS和行动IMS,分别计算初始人和行动排名;第二,两个分直接结合以产生最后排名清单;然而,两个个人INS分的直接组合并不能保证个人与行动之间的身份一致性。例如,用“Pat是站着的”和“Ian是坐在沙发上的”的镜头可能被错误地理解为“Pat是坐在沙发上”或“Ian是站着的”。为了解决上述身份不一致问题(IIP),我们研究了一个随机时代身份特征核实方法。具体地说,在空间方面,我们提出了一个身份一致性核查计划,以优化个人INS和行动 INS的直接融合分数。这些动机来自一个观测结果通常位于身份-一致性行动捆绑框中的观测结果。此外,在时间层面,考虑到复杂程度的图像-对比测试方法中,我们提议的深度测量方法的跨比范围展示了现有20级测试结果。