Structured representations such as keypoints are widely used in pose transfer, conditional image generation, animation, and 3D reconstruction. However, their supervised learning requires expensive annotation for each target domain. We propose a self-supervised method that learns to disentangle object structure from the appearance with a graph of 2D keypoints linked by straight edges. Both the keypoint location and their pairwise edge weights are learned, given only a collection of images depicting the same object class. The graph is interpretable, for example, AutoLink recovers the human skeleton topology when applied to images showing people. Our key ingredients are i) an encoder that predicts keypoint locations in an input image, ii) a shared graph as a latent variable that links the same pairs of keypoints in every image, iii) an intermediate edge map that combines the latent graph edge weights and keypoint locations in a soft, differentiable manner, and iv) an inpainting objective on randomly masked images. Although simpler, AutoLink outperforms existing self-supervised methods on the established keypoint and pose estimation benchmarks and paves the way for structure-conditioned generative models on more diverse datasets.
翻译:键点等结构性表示方式被广泛用于配置传输、有条件的图像生成、动画和 3D 重建。 但是, 受监督的学习要求每个目标域花费昂贵的注释。 我们建议了一种自监督的方法, 以直边缘连接的 2D 键点图解将对象结构与外观分解。 关键点位置及其配对边边边的重量都是学习的, 仅提供一组描述同一对象类的图像。 图表是可以解释的, 例如, Autolink 在对显示人的图像应用时, 恢复人类骨骼表层。 我们的关键成分是 i) 用于预测输入图像中关键点位置的编码器, ii) 将共享图表作为潜在变量, 将每个图像中的相同关键点连接在一起, iii) 中间边图, 将潜在图形边边的重量和关键点位置以软化、 不同的方式组合在一起, iv) 随机遮盖图像的调整目标。 虽然简单, AutLink 超越了在设定的基因结构上现有的自我监督方法, 并提出了更多样化的基点和基质估计基准。