Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.
翻译:经验和推理跨越多个时间尺度: 毫秒、 秒、 秒、 小时或 日。 然而, 绝大多数计算机视觉研究仍然侧重于单个图像或短视频, 仅持续几秒钟。 这是因为处理较长视频需要更灵活的方法, 甚至处理它们。 在这项工作中, 我们提出一个框架, 使对长小时视频的研究能够进行, 其硬件与现在能够处理第二长视频的相同。 我们用神经压缩取代标准视频压缩, 例如JPEG, 以神经压缩方式取代标准视频压缩, 并显示我们可以直接将压缩视频作为常规视频网络的输入。 压缩视频操作可以提高所有管道级别的效率 -- -- 数据传输、速度和记忆 -- -- 使得能够更快和长得多的视频上培训模型。 然而, 处理压缩信号如果是天真的, 则具有排除标准增强技术的下端面。 我们通过引入一个小网络, 可以对原始视频空间中常用的增强器进行转换。 我们证明, 我们通过压缩的视觉管道, 我们可以更高效地用流行的基准来培训视频模型, 如 Kiniticaltic 600 和COIN 。 我们还进行证据概念实验实验, 并且不使用新的压缩的 进行长期的实验, 进行新的压缩式实验, 并且没有在小时制模制模制模制模模制模制模制模制模制模制模制模制模制模制模制模制模制模。