Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git
翻译:视觉变异器在许多计算机视觉任务中取得了巨大成功。 大多数方法都通过将图像分割成固定和固定的网格,将每个单元格作为象征来对待,从而产生视觉标志。 但是,并非所有区域在以人为中心的视觉任务中都同样重要,例如,人体需要用许多象征物进行精细的表示,而图像背景则可以用几个象征物来模拟。为了解决这个问题,我们提议了一个新的视觉变异器,叫做Token 组合变异器(TC Former),它通过渐进组合组合将象征物合并在一起,使象征物可以与不同地点的形状和大小都具有灵活性。 TCFormer的象征物不仅可以集中在重要领域,还可以调整符号形状以适应语义概念,并为包含关键细节的区域通过一个精确的解决方案,这有助于获取详细的信息。 广泛的实验显示,TCFormer在不同的具有挑战性的人为中心的任务和数据集方面,包括全体对CO-HalleBody和3D 人类中位重建3DP1/WMORMD。 DC可查到, httpsmargmevorm@gstast.