Untrimmed videos on social media or those captured by robots and surveillance cameras are of varied aspect ratios. However, 3D CNNs require a square-shaped video whose spatial dimension is smaller than the original one. Random or center-cropping techniques in use may leave out the video's subject altogether. To address this, we propose an unsupervised video cropping approach by shaping this as a retargeting and video-to-video synthesis problem. The synthesized video maintains 1:1 aspect ratio, smaller in size and is targeted at the video-subject throughout the whole duration. First, action localization on the individual frames is performed by identifying patches with homogeneous motion patterns and a single salient patch is pin-pointed. To avoid viewpoint jitters and flickering artifacts, any inter-frame scale or position changes among the patches is performed gradually over time. This issue is addressed with a poly-Bezier fitting in 3D space that passes through some chosen pivot timestamps and its shape is influenced by in-between control timestamps. To corroborate the effectiveness of the proposed method, we evaluate the video classification task by comparing our dynamic cropping with static random on three benchmark datasets: UCF-101, HMDB-51 and ActivityNet v1.3. The clip accuracy and top-1 accuracy for video classification after our cropping, outperform 3D CNN performances for same-sized inputs with random crop; sometimes even surpassing larger random crop sizes.
翻译:社交媒体或由机器人和监控摄像机拍摄的未剪辑视频在社交媒体或由机器人和监控摄像摄像机拍摄的视频中,其比例各异。然而,3DCNN需要一张平方形视频,其空间范围小于原来的空间范围。正在使用的随机或中裁剪技术可能会完全忽略视频主题。为了解决这个问题,我们建议采用一种不受监督的视频裁剪方法,将它塑造成一个重新定位和视频到视频到视频的合成问题。合成视频在整个期间保持1:1宽比,规模较小,并以视频对象为目标。首先,单个框架的行动本地化工作是通过识别具有同质运动模式和单一突出的偏重点点的片段段,来进行。为避免观看吉特和闪亮的作品,任何跨框架规模或补丁间位置的变化都会随着时间的推移而进行。 这一问题在3D空间的多贝塞尔安装一个通过某些选定的微分调时网,其形状会受到控制时间戳的影响。为了证实拟议方法的有效性,我们评估视频分类任务的任意范围,通过对比我们动态的作物分类之后的准确性动态作物分类,并对比我们的固定的作物分类。