Weakly supervised video object localization (WSVOL) allows locating object in videos using only global video tags such as object class. State-of-art methods rely on multiple independent stages, where initial spatio-temporal proposals are generated using visual and motion cues, then prominent objects are identified and refined. Localization is done by solving an optimization problem over one or more videos, and video tags are typically used for video clustering. This requires a model per-video or per-class making for costly inference. Moreover, localized regions are not necessary discriminant because of unsupervised motion methods like optical flow, or because video tags are discarded from optimization. In this paper, we leverage the successful class activation mapping (CAM) methods, designed for WSOL based on still images. A new Temporal CAM (TCAM) method is introduced to train a discriminant deep learning (DL) model to exploit spatio-temporal information in videos, using an aggregation mechanism, called CAM-Temporal Max Pooling (CAM-TMP), over consecutive CAMs. In particular, activations of regions of interest (ROIs) are collected from CAMs produced by a pretrained CNN classifier to build pixel-wise pseudo-labels for training the DL model. In addition, a global unsupervised size constraint, and local constraint such as CRF are used to yield more accurate CAMs. Inference over single independent frames allows parallel processing of a clip of frames, and real-time localization. Extensive experiments on two challenging YouTube-Objects datasets for unconstrained videos, indicate that CAM methods (trained on independent frames) can yield decent localization accuracy. Our proposed TCAM method achieves a new state-of-art in WSVOL accuracy, and visual results suggest that it can be adapted for subsequent tasks like visual object tracking and detection. Code is publicly available.
翻译:受到微弱监督的视频对象本地化( WSVOL) 允许仅使用像目标类这样的全球视频标签将目标定位在视频中。 状态工具依赖于多个独立阶段, 即使用视觉和运动提示生成初始级时空建议, 然后识别和精化突出对象。 本地化是通过解决一个或多个视频的优化问题完成的, 视频标签通常用于视频集成。 这需要每部视频或每类视频模型, 做出昂贵的推断。 此外, 本地化区域没有必要进行对比, 因为像光学流这样的不受监督的物体动作方法, 或者因为视频标签被从优化中丢弃。 在本文中, 我们利用以静态图像为基础设计的成功类启动映像仪( CAM) 映射方法。 新的温度 CAM (TCAM (TCAM) 模式用于培养更正态的深度学习模式。 在视频中, CAM- 的本地化(COL- TOM ) 和 连续 CAM (CAM- 的连续的快速性处理中, 快速化工具可以让内部的图像流流流流进化区域获得。