Inspired by the fact that human eyes continue to develop tracking ability in early and middle childhood, we propose to use tracking as a proxy task for a computer vision system to learn the visual representations. Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations that would help with video-related tasks. In the proposed pretraining framework, we cut an image patch from a given video and let it scale and move according to a pre-set trajectory. The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame. We discover that using multiple image patches simultaneously brings clear benefits. We further increase the difficulty of the game by randomly making patches invisible. Extensive experiments on mainstream benchmarks demonstrate the superior performance of CtP against other video pretraining methods. In addition, CtP-pretrained features are less sensitive to domain gaps than those trained by a supervised action recognition task. When both trained on Kinetics-400, we are pleasantly surprised to find that CtP-pretrained representation achieves much higher action classification accuracy than its fully supervised counterpart on Something-Something dataset. Code is available online: github.com/microsoft/CtP.
翻译:受3D-CNN模型的启发,人类的眼睛继续在幼儿和中童年阶段发展跟踪能力,我们提议将跟踪作为计算机视觉系统的一个代理任务,以学习视觉显示。在儿童玩的Catch游戏上,我们以儿童玩的Catch游戏为模型设计了一个3D-CNN模型的Catch-Patch(CtP)游戏,以学习有助于视频相关任务的视觉表现。在拟议的培训前框架中,我们从给定视频中剪切除一个图像补丁,让它缩放,并按照预设的轨迹移动。代理的任务是在视频框序列中估计图像补丁的位置和大小,只考虑到第一个框中的目标捆绑框。我们发现,使用多个图像补丁同时带来明显的好处。我们通过随机地将补丁变成隐形,进一步增加了游戏的难度。关于主流基准的广泛实验显示了CtP相对于其他视频预培训方法的优异性表现。此外,CtP受限制的功能对域差距比受监督的行动识别任务要敏感。在Kinitics-400中,我们很惊讶地发现,我们发现使用多个图像补置的C-crofolticregill exexexexexlaction action:我们完全的C-creduction action action action.