Labeling articulated objects in unconstrained settings have a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled. The approach, however, is based on calibrated cameras and rigid geometry, making it expensive, difficult to manage, and impractical in real-world scenarios. In this paper, we address these bottlenecks by combining a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark estimates from videos with only two or three uncalibrated, handheld cameras. With just a few annotations (representing 1-2% of the frames), we are able to produce 2D results comparable to state-of-the-art fully supervised methods, along with 3D reconstructions that are impossible with other existing approaches. Our Multi-view Bootstrapping in the Wild (MBW) approach demonstrates impressive results on standard human datasets, as well as tigers, cheetahs, fish, colobus monkeys, chimpanzees, and flamingos from videos captured casually in a zoo. We release the codebase for MBW as well as this challenging zoo dataset consisting image frames of tail-end distribution categories with their corresponding 2D, 3D labels generated from minimal human intervention.
翻译:在不受限制的环境下, 标签显示显示显示精细检测器的多镜头系统有多种广泛的应用, 包括娱乐、 神经科学、 心理学、 ethlogy 和许多医学领域。 大型离线标签标签的数据集并不适用于所有的人, 但最常用的表达式对象类别( 例如人类) 。 在视频序列中标记这些地标是一个艰巨的任务。 具有里程碑标志性的探测器可以提供帮助, 但只要从几个例子中培训, 就可能容易出错。 训练精细度检测器的多镜头系统在检测此类视频错误方面表现出巨大的希望, 允许自我监督的解决方案, 只需要手贴少量的视频序列。 但是, 大型离线标签标签的数据集并不是所有人都存在, 但是, 在现实世界情景中, 手标签标签标签标签标签标签标签标签标签标签标签标签标签标签是一个非常昂贵, 在现实世界情景中, 是不切实际操作的。 本文中, 我们用非硬度的 3D 神经质标志来解决这些瓶颈问题, 只能从两个或三个不精确的图像中获取高密度的标志性标志性标值的估测测测测测测测测测测测值估计, 。 。,,, 我们可以用3O型的模型中, 以3D 以3 的模型中的数据比其他数据序列的模型的模型的模型的模型 。