Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code, model checkpoints and training logs are available on https://github.com/Genera1Z/RandSF.Q.
翻译:无监督视频对象中心学习(OCL)具有广阔前景,因为它能够像人类一样实现对象级场景表示与动态建模。主流视频OCL方法采用循环架构:聚合器在特定查询条件下将当前视频帧整合为对象特征(称为槽);转换器则将当前槽转换为下一帧的查询。该架构虽有效,但现有实现均存在两方面缺陷:(i1)未能纳入下一帧特征(查询预测最具信息量的来源);(i2)未能学习转换动态(查询预测所需的关键知识)。为解决这些问题,我们提出用于查询预测学习的随机槽-特征对方法(RandSF.Q):(t1)设计新型转换器以同时融合槽与特征,为查询预测提供更丰富信息;(t2)通过从可用循环中随机采样的槽-特征对训练转换器预测查询,驱动其学习转换动态。场景表示实验表明,本方法显著超越现有视频OCL方法,例如在对象发现任务上提升高达10个百分点,创下最新技术水平。此优势亦惠及动态建模等下游任务。核心源代码、模型检查点与训练日志已发布于https://github.com/Genera1Z/RandSF.Q。