We propose Human Pose Models that represent RGB and depth images of human poses independent of clothing textures, backgrounds, lighting conditions, body shapes and camera viewpoints. Learning such universal models requires training images where all factors are varied for every human pose. Capturing such data is prohibitively expensive. Therefore, we develop a framework for synthesizing the training data. First, we learn representative human poses from a large corpus of real motion captured human skeleton data. Next, we fit synthetic 3D humans with different body shapes to each pose and render each from 180 camera viewpoints while randomly varying the clothing textures, background and lighting. Generative Adversarial Networks are employed to minimize the gap between synthetic and real image distributions. CNN models are then learned that transfer human poses to a shared high-level invariant space. The learned CNN models are then used as invariant feature extractors from real RGB and depth frames of human action videos and the temporal variations are modelled by Fourier Temporal Pyramid. Finally, linear SVM is used for classification. Experiments on three benchmark cross-view human action datasets show that our algorithm outperforms existing methods by significant margins for RGB only and RGB-D action recognition.
翻译:我们建议使用不以衣质、背景、照明条件、身体形状和照相机视角为独立代表 RGB 和人类外形深度图像的人体囊形模型。学习这些通用模型需要各种因素的训练图象,这些模型对每个人的外形都有差异。获取这些数据的费用令人望而却步。因此,我们开发了一个将培训数据合成的框架。首先,我们从大量真实运动捕获的人体骨骼数据中学习具有代表性的人体外形。接着,我们把具有不同体形的合成3D人与每个外形相配,并从180个摄影机角度中互换,同时随机地改变服装纹理、背景和照明。基因反向网络被用来尽量减少合成图像和真实图像分布之间的差距。然后,CNN模型学会将人类的外形转换到一个共同的高度变异空间。我们学到的CNN模型随后被用作真实RGB和人类动作深度框架的变异特性提取器,而时间变异由Fourier Temor Pyramid模拟。最后,线性 SVperM 用于分类。在三个基准的跨视距模型上进行实验,只有RVM 的演算法动作外演算,显示我们现有的RGB 。