Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e.g., NeRV, E-NeRV). While achieving promising results, existing INR-based methods are limited to encoding a handful of short videos (e.g., seven 5-second videos in the UVG dataset) with redundant visual content, leading to a model design that fits individual video frames independently and is not efficiently scalable to a large number of diverse videos. This paper focuses on developing neural representations for a more practical setup -- encoding long and/or a large number of videos with diverse visual content. We first show that instead of dividing videos into small subsets and encoding them with separate models, encoding long and diverse videos jointly with a unified model achieves better compression results. Based on this observation, we propose D-NeRV, a novel neural representation framework designed to encode diverse videos by (i) decoupling clip-specific visual content from motion information, (ii) introducing temporal reasoning into the implicit neural network, and (iii) employing the task-oriented flow as intermediate output to reduce spatial redundancies. Our new model largely surpasses NeRV and traditional video compression techniques on UCF101 and UVG datasets on the video compression task. Moreover, when used as an efficient data-loader, D-NeRV achieves 3%-10% higher accuracy than NeRV on action recognition tasks on the UCF101 dataset under the same compression ratios.
翻译:隐式神经表示(INR)已经在表示3D场景和图像方面获得了越来越多的关注,并且最近已被应用于编码视频(例如,NeRV,E-NeRV)。虽然取得了有希望的结果,但现有的基于INR的方法仅限于编码少量具有冗余视觉内容的短视频(例如,UVG数据集中的七个5秒视频),导致模型设计仅适用于独立地适应每个视频帧,并且无法有效地扩展到大量多样化的视频。本文的重点是开发神经表示方法以更实用的方式--编码长视频或大量多样化的视频。我们首先展示,与将视频分成小的子集并使用单独的模型进行编码相比,使用统一模型联合编码长且多样化的视频可以取得更好的压缩结果。基于这个观察结果,我们提出了D-NeRV,一种新的神经表示方法,旨在通过(i)分离以片段为特定的视觉信号与运动信息,(ii)引入时间推理到隐式神经网络中,(iii)采用面向任务的流作为中间输出以减少空间冗余。我们的新模型在视频压缩任务上比NeRV和传统的视频压缩技术在UCF101和UVG数据集上都有很大的提升。此外,当作为高效的数据加载器使用时,D-NeRV在相同压缩比下UCF101数据集的动作识别任务上比NeRV高3%-10%的精度。