The central idea of contrastive learning is to discriminate between different instances and force different views of the same instance to share the same representation. To avoid trivial solutions, augmentation plays an important role in generating different views, among which random cropping is shown to be effective for the model to learn a strong and generalized representation. Commonly used random crop operation keeps the difference between two views statistically consistent along the training process. In this work, we challenge this convention by showing that adaptively controlling the disparity between two augmented views along the training process enhances the quality of the learnt representation. Specifically, we present a parametric cubic cropping operation, ParamCrop, for video contrastive learning, which automatically crops a 3D cubic from the video by differentiable 3D affine transformations. ParamCrop is trained simultaneously with the video backbone using an adversarial objective and learns an optimal cropping strategy from the data. The visualizations show that the center distance and the IoU between two augmented views are adaptively controlled by ParamCrop and the learned change in the disparity along the training process is beneficial to learning a strong representation. Extensive ablation studies demonstrate the effectiveness of the proposed ParamCrop on multiple contrastive learning frameworks and video backbones. With ParamCrop, we improve the state-of-the-art performance on both HMDB51 and UCF101 datasets.
翻译:对比性学习的核心思想是区分不同实例,迫使不同实例的不同观点共享相同代表。为了避免微小的解决方案,增殖在产生不同观点方面起着重要作用,其中随机裁剪被证明对模型学习强大和普遍代表性十分有效。通常使用的随机裁剪操作使两种观点之间的差别在统计上与培训过程一致。在这项工作中,我们通过显示在培训过程中适应性地控制两种强化观点之间的差距提高了所学代表性的质量来挑战这项公约。具体地说,我们展示了一种对立立立立方裁剪裁,用于视频对比学习,通过可不同3D的亲吻变自动从视频中产生3D立方,其中随机裁剪裁利用对立目标与视频主干同时培训,并从数据中学习最佳的裁剪裁战略。视觉化显示,帕拉姆克罗普在培训过程中适应性地控制了两种强化观点之间的距离和IOU,这两类增强观点的质量。具体地说,我们展示了在培训过程中所了解的差异,有助于学习强有力的代表性。广度的Aram-C基础研究展示了我们拟议的高频学习框架。