Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and aggregate information from local contexts. The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion, but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both full sequence and single target frame scales applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and hence helps produce smoother and more accurate 3D poses. The proposed Strided Transformer is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with fewer parameters. Code and models are available at \url{https://github.com/Vegetebird/StridedTransformer-Pose3D}.
翻译:尽管通过视频在3D人造图像估算方面取得了巨大进展,但充分利用冗余的 2D 配置序列仍是一个尚未解决的问题。为此,我们提议改进基于变压器的架构,称为Strided 变压器,简单而有效地将2D 组合位置的长序提升为1 3D 组合。具体地说,采用Vanilla 变压器 Encoder(VTE),以模拟2D 构成序列的远程依赖性。为了减少序列的冗余,VTE 供料前网络中完全连接的层被替换为Straded Convolutions,以逐步缩短序列长度和从当地背景获得的总体信息。修改后的变压器被称为Stridad 变压器 Eccoder(STE),以VTE的输出为基础,不仅有效地将长程信息汇总到拟议的全球和地方等级的单个矢量代表制,而且大幅降低计算成本。此外,全到全向D的调控管系统系统,在全序和单一目标框架下分别对ST-deleal-deal 和Stardeal-deal Flax 做了精确的调整。