Learning reliable motion representation between consecutive frames, such as optical flow, has proven to have great promotion to video understanding. However, the TV-L1 method, an effective optical flow solver, is time-consuming and expensive in storage for caching the extracted optical flow. To fill the gap, we propose UF-TSN, a novel end-to-end action recognition approach enhanced with an embedded lightweight unsupervised optical flow estimator. UF-TSN estimates motion cues from adjacent frames in a coarse-to-fine manner and focuses on small displacement for each level by extracting pyramid of feature and warping one to the other according to the estimated flow of the last level. Due to the lack of labeled motion for action datasets, we constrain the flow prediction with multi-scale photometric consistency and edge-aware smoothness. Compared with state-of-the-art unsupervised motion representation learning methods, our model achieves better accuracy while maintaining efficiency, which is competitive with some supervised or more complicated approaches.
翻译:连续框架之间学习的可靠的运动代表,例如光学流,已证明极大地促进了视频理解。然而,电视-L1方法,一个有效的光流求解器,在压缩提取的光流方面耗时费钱,储存费用昂贵。为了填补这一空白,我们提议了UF-TSN,这是一个新的端对端行动识别方法,用嵌入式轻量光量且不受监督的光流测算器加以强化。UF-TSN估计,从相邻框架以粗略到平整的方式发出运动提示,并侧重于通过提取特征金字塔和根据最后水平的估计流向另一层次扭曲而使每个层次的小移动。由于缺乏动作数据集的标签动作动作动作动作,我们以多尺度光度一致性和边宽度来限制流动预测。与最先进的、不受监督的运动代表学习方法相比,我们的模型在保持效率的同时实现了更高的准确性,而效率则与某些受监督或更复杂的方法相比是竞争性的。