We present StreamDEQ, a method that infers frame-wise representations on videos with minimal per-frame computation. In contrast to conventional methods where compute time grows at least linearly with the network depth, we aim to update the representations in a continuous manner. For this purpose, we leverage the recently emerging implicit layer models, which infer the representation of an image by solving a fixed-point problem. Our main insight is to leverage the slowly changing nature of videos and use the previous frame representation as an initial condition on each frame. This scheme effectively recycles the recent inference computations and greatly reduces the needed processing time. Through extensive experimental analysis, we show that StreamDEQ is able to recover near-optimal representations in a few frames time, and maintain an up-to-date representation throughout the video duration. Our experiments on video semantic segmentation and video object detection show that StreamDEQ achieves on par accuracy with the baseline (standard MDEQ) while being more than $3\times$ faster. Code and additional results are available at https://ufukertenli.github.io/streamdeq/.
翻译:我们提出StraamDEQ, 这是一种对每个框架计算最少的视频进行框架化表达的方法。 与计算时间至少随着网络深度线性增长的传统方法相反, 我们的目标是不断更新表达方式。 为此, 我们利用最近出现的隐含层模型, 通过解决固定点问题来推断图像的表达方式。 我们的主要洞察力是利用视频的缓慢变化性质,并将先前的框架表达方式作为每个框架的初始条件。 这个方法有效地回收了最近的推断计算,并大大缩短了所需的处理时间。 通过广泛的实验分析, 我们显示 StreamDEQ能够在几个框架时间内恢复接近最佳的表达方式, 并且在整个视频期间保持最新的表达方式。 我们关于视频中语系分割和视频对象检测的实验显示, StreamDEQ 取得了与基线(标准 MDEQ) 相当的精确度, 同时速度要快于 3\times 。 在 https://ufuktenli.githubio. /struedqu 上可以找到代码和额外的结果 。