The task of human pose estimation (HPE) deals with the ill-posed problem of estimating the 3D position of human joints directly from images and videos. In recent literature, most of the works tackle the problem mostly by using convolutional neural networks (CNNs), which are capable of achieving state-of-the-art results in most datasets. We show how most neural networks are not able to generalize well when the camera is subject to significant viewpoint changes. This behaviour emerges because CNNs lack the capability of modelling viewpoint equivariance, while they rather rely on viewpoint invariance, resulting in high data dependency. Recently, capsule networks (CapsNets) have been proposed in the multi-class classification field as a solution to the viewpoint equivariance issue, reducing both the size and complexity of both the training datasets and the network itself. In this work, we show how capsule networks can be adopted to achieve viewpoint equivariance in human pose estimation. We propose a novel end-to-end viewpoint-equivariant capsule autoencoder that employs a fast Variational Bayes routing and matrix capsules. We achieve state-of-the-art results for multiple tasks and datasets while retaining other desirable properties, such as greater generalization capabilities when changing viewpoints, lower data dependency and fast inference. Additionally, by modelling each joint as a capsule, the hierarchical and geometrical structure of the overall pose is retained in the feature space, independently from the viewpoint. We further test our network on multiple datasets, both in the RGB and depth domain, from seen and unseen viewpoints and in the viewpoint transfer task.
翻译:人形估测( HPE) 的任务涉及直接从图像和视频来估计 3D 人关节位置的错误问题。 在最近的文献中, 大部分作品主要通过使用能够实现大多数数据集中最新结果的进化神经网络(CNNs)来解决问题。 我们展示了大多数神经网络在相机受到显著观点变化时如何无法全面推广。 这种行为之所以出现,是因为CNNs缺乏建模观点变异性的能力,而它们却依赖观点变异性,从而导致数据依赖性高。 最近, 胶囊网络(CapsNets)被推荐在多级分类领域处理问题, 解决观点变异性网络(CNNs) 的问题, 能够降低培训数据集和网络本身的大小和复杂性。 在这项工作中, 我们展示了胶囊网络如何能够在人形估估测中实现观点变异性。 我们提出一个新的端到端- 观点变异性胶囊自动变形胶囊, 从快速的Variational- bal- greal 视图和矩阵中, 我们看到了我们每个视野结构变异性结构结构变异的数据, 和矩阵变异性数据在不断更新中, 我们的轨变色变色变色变换的每个数据中, 都能看到和变色色变变变变变变变的变变的变的变的变更变更变更变的变的变的变的变的变更变换数据, 。