Most current action recognition methods heavily rely on appearance information by taking an RGB sequence of entire image regions as input. While being effective in exploiting contextual information around humans, e.g., human appearance and scene category, they are easily fooled by out-of-context action videos where the contexts do not exactly match with target actions. In contrast, pose-based methods, which take a sequence of human skeletons only as input, suffer from inaccurate pose estimation or ambiguity of human pose per se. Integrating these two approaches has turned out to be non-trivial; training a model with both appearance and pose ends up with a strong bias towards appearance and does not generalize well to unseen videos. To address this problem, we propose to learn pose-driven feature integration that dynamically combines appearance and pose streams by observing pose features on the fly. The main idea is to let the pose stream decide how much and which appearance information is used in integration based on whether the given pose information is reliable or not. We show that the proposed IntegralAction achieves highly robust performance across in-context and out-of-context action video datasets. The codes are available in https://github.com/mks0601/IntegralAction_RELEASE.
翻译:目前大多数行动识别方法都在很大程度上依赖外观信息,将整个图像区域中的RGB序列作为投入。虽然它们有效地利用了人类周围的背景信息,例如人类外观和场景类别,但很容易被外相行动视频所愚弄,而环境与目标行动并不完全吻合。相比之下,基于外形的方法,仅将人类骨骼序列作为输入,但仅作为输入而使用不准确的人类骨骼序列,其本身就存在不准确的构成估计或模糊性。将这两种方法结合起来的结果是非三角的;对一个外观模式进行培训,最终对外观产生强烈的偏向,而不会对看不见的视频进行概括化。为了解决这一问题,我们提议学习由外观和流体形成动态组合的外观特征整合,通过在飞体上观测外观特征。主要的想法是让外观流根据所提供的信息是否可靠来决定在整合中使用多少和哪些外观信息。我们发现,拟议的综合行动在相框和外框外动作中都取得了高度有力的性表现,而不是对看不见的视频数据集。我们建议学习由外观组合组合组合组合组合组合组合组合组合而成的特性。ANS/LEO06/LEOmb/GIS。