We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the emotional state of the human into one of four emotions: happy, sad, angry, or neutral. We use hundreds of annotated real-world gait videos and augment them with thousands of annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of $2,177$ human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 89% on E-Gait, which is 14 - 30% more accurate over prior methods.
翻译:我们展示了一个叫STEP的新式分类网络,根据空间时空图革命网络(ST-GCN)结构,将人类感知的情感从音轨中分类。根据一个基于空间时空图变迁网络(ST-GCN)结构的 RGB 个人行走视频,我们的配方暗含地利用动作特征将人的情感状态分为四种情绪之一:快乐、悲伤、愤怒或中性。我们使用数百个附加注释的真实世界行走视频,并用使用一个以STEP-Gen(ST-GCN)为基础的新式基因网络(STEP-Gen)生成的数千个附加注释的合成动作来补充这些动作。在实践中,STEPEP可以学习基于ST-GCN 条件变异性自动车(CVAE) 的89% 的视觉特征并展示分类精确度(E-Gait) 89% 的精确度和 14- 30 前方法的精确度。