利用视频流的 " 现场学习 " 的 " 现场学习 ",使时间行动地方化 (Enabling Weakly-Supervised Temporal Action Localization from On-Device Learning of the Video Stream)

from arxiv, Manuscript received April 07, 2022; revised June 11, 2022; accepted July 05, 2022. This article was presented in the International Conference on 2022 and appears as part of the ESWEEK-TCAD special issue

Detecting actions in videos have been widely applied in on-device applications. Practical on-device videos are always untrimmed with both action and background. It is desirable for a model to both recognize the class of action and localize the temporal position where the action happens. Such a task is called temporal action location (TAL), which is always trained on the cloud where multiple untrimmed videos are collected and labeled. It is desirable for a TAL model to continuously and locally learn from new data, which can directly improve the action detection precision while protecting customers' privacy. However, it is non-trivial to train a TAL model, since tremendous video samples with temporal annotations are required. However, annotating videos frame by frame is exorbitantly time-consuming and expensive. Although weakly-supervised TAL (W-TAL) has been proposed to learn from untrimmed videos with only video-level labels, such an approach is also not suitable for on-device learning scenarios. In practical on-device learning applications, data are collected in streaming. Dividing such a long video stream into multiple video segments requires lots of human effort, which hinders the exploration of applying the TAL tasks to realistic on-device learning applications. To enable W-TAL models to learn from a long, untrimmed streaming video, we propose an efficient video learning approach that can directly adapt to new environments. We first propose a self-adaptive video dividing approach with a contrast score-based segment merging approach to convert the video stream into multiple segments. Then, we explore different sampling strategies on the TAL tasks to request as few labels as possible. To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.

翻译：视频中的检测动作已被广泛应用在设置应用程序中。实用的在线设置视频总是用动作和背景来解开操作和背景来解开。理想的模型是既识别动作组,又将动作发生的时间位置本地化的模型。这种任务被称为时间动作位置( TAL ), 它总是在云层上训练, 在那里收集并贴上标签。 TAL 模式应该不断和本地地从新数据中学习, 这样可以直接改进动作检测精度, 同时保护客户隐私。但是, 训练TAL 模型是非边际的, 因为需要大量的视频样本样本样本样本和时间说明。然而, 使用框架来说明视频框架的注释框架非常耗时费和昂贵。虽然对 TAL (W- TAL) 进行微弱的搜索, 可以在云云中学习未剪接的视频, 这样的方法也不适合在网上学习情景。在实用的在线学习中, 正在收集数据, 在流中, 在多个视频流流流中进行如此漫长的视频流流流流流流中, 需要大量直接的图像流学习视频流。