This paper tackles a new problem in computer vision: mid-stream video-to-video retrieval. This task, which consists in searching a database for content similar to a video right as it is playing, e.g. from a live stream, exhibits challenging characteristics. Only the beginning part of the video is available as query and new frames are constantly added as the video plays out. To perform retrieval in this demanding situation, we propose an approach based on a binary encoder that is both predictive and incremental in order to (1) account for the missing video content at query time and (2) keep up with repeated, continuously evolving queries throughout the streaming. In particular, we present the first hashing framework that infers the unseen future content of a currently playing video. Experiments on FCVID and ActivityNet demonstrate the feasibility of this task. Our approach also yields a significant mAP@20 performance increase compared to a baseline adapted from the literature for this task, for instance 7.4% (2.6%) increase at 20% (50%) of elapsed runtime on FCVID using bitcodes of size 192 bits.
翻译:本文处理计算机视觉中的新问题: 中流视频到视频检索。 此任务包括搜索数据库中的内容与正在播放的视频右侧类似, 例如从现场流中搜索一个数据库, 展示具有挑战性的特点。 只有视频的开头部分可以作为查询提供, 并且随着视频的播放而不断添加新的框架 。 在这种困难的情况下, 为了进行检索, 我们提议了一种基于二进制编码器的方法, 它既具有预测性, 也具有递增性, 以便(1) 说明查询时丢失的视频内容, (2) 在整个流流中不断重复、 不断演变的查询。 特别是, 我们展示了第一个预断框架, 用以推断当前播放的视频的未知未来内容。 FCVID 和 ActionNet 实验证明了这项任务的可行性。 我们的方法还产生显著的 mAP@20 性能增长, 与从文献中为这项任务调整的基线相比, 例如7.4% ( 2. 6 %) 增长20% (50%) 的FCVID 运行时间过20% ( 50) 。