Previous methods based on 3DCNN, convLSTM, or optical flow have achieved great success in video salient object detection (VSOD). However, they still suffer from high computational costs or poor quality of the generated saliency maps. To solve these problems, we design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. Furthermore, previous methods only considered single-frame prediction without temporal association. As a result, the model may not focus on the temporal information sufficiently. Thus, we initially introduce object motion prediction between inter-frame into VSOD. Our model follows standard encoder--decoder architecture. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. This approach is more efficient than the optical flow-based methods. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The semantic information of the high-level features is used to fuse the object details in the low-level features, and then the spatiotemporal features are obtained step by step to reconstruct the saliency maps. Moreover, inspired by the boundary supervision commonly used in image salient object detection (ISOD), we design a motion-aware loss for predicting object boundary motion and simultaneously perform multitask learning for VSOD and object motion prediction, which can further facilitate the model to extract spatiotemporal features accurately and maintain the object integrity. Extensive experiments on several datasets demonstrated the effectiveness of our method and can achieve state-of-the-art metrics on some datasets. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
翻译:以 3DCNN 、 convLSTM 或光学流为基础的以往方法在视频突出对象探测( VSOD) 上取得了巨大成功。 但是,它们仍然受到高计算成本或生成的突出显示图质量差的损害。 为了解决这些问题,我们设计了一个基于空间-时间内存(STM) 的网络, 利用VSOD的时间分支从相邻框架中提取当前框架的有用时间信息。 此外, 以往的方法只考虑单一框架的预测, 没有时间关联。 因此, 模型可能没有足够关注时间信息。 因此, 我们最初在 VSOD 中引入了跨框架之间的物体运动预测。 我们的模型遵循标准的编码- 解码结构。 在编码阶段, 我们通过使用当前及其相邻框架的高亮的特性来生成高亮的时间特性。 这个方法比光学流法更有效率。 在分解阶段, 我们提出一个有效的空间和时间分支的聚合战略。 高端状态信息被用于将物体在低层次的物体质量目标特性特性特性中整合, 我们的模型- 运行- 运行轨道- 演示的轨道- 演示的轨道- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 走向- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示- 演示-导- 演示- 走向- 走向- 走向- 演示- 演示-导- 演示- 演示-导- 、 、 演示- 走向-导- 走向- 走向- 演示-导- 演示- 演示- 演示-导-导-导- 演示-导- 演示-导-导-导- 演示-导- 演示-导- 演示- 、 演示-导- 演示- 演示-导-导-导-导-导-