This paper describes recent developments in object specific pose and shape prediction from single images. The main contribution is a new approach to camera pose prediction by self-supervised learning of keypoints corresponding to locations on a category specific deformable shape. We designed a network to generate a proxy ground-truth heatmap from a set of keypoints distributed all over the category-specific mean shape, where each is represented by a unique color on a labeled texture. The proxy ground-truth heatmap is used to train a deep keypoint prediction network, which can be used in online inference. The proposed approach to camera pose prediction show significant improvements when compared with state-of-the-art methods. Our approach to camera pose prediction is used to infer 3D objects from 2D image frames of video sequences online. To train the reconstruction model, it receives only a silhouette mask from a single frame of a video sequence in every training step and a category-specific mean object shape. We conducted experiments using three different datasets representing the bird category: the CUB [51] image dataset, YouTubeVos and the Davis video datasets. The network is trained on the CUB dataset and tested on all three datasets. The online experiments are demonstrated on YouTubeVos and Davis [56] video sequences using a network trained on the CUB training set.
翻译:本文描述单个图像中特定对象的外观和形状预测的最新发展情况。 主要贡献是采用一种新的方法,通过自我监督学习与特定变形形状类别不同地点相对的基点进行预测。 我们设计了一个网络,从分布在特定类别平均形状上的一系列关键点中产生代理地面真象热映射图,其中每个点在标签的纹理上都有独特的颜色代表。 代理地面真情热映射用于训练一个深重关键点预测网络,可以在网上推断中使用。 拟议的摄影显示预测显示与最新技术方法相比的重大改进。 我们的摄影预测方法用于从在线视频序列的2D图像框中推导出3D对象。 为了训练重建模型,它只能从每个培训步骤的单一视频序列框架和一个特定类别的平均对象形状中获得一个光影遮罩。 我们使用三个不同的数据集进行了实验, 代表鸟类类别: CUB [51] 图像数据集、YouTubeVos 和 Davis Veb 图像序列网络上经过训练的所有数据测试。