The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-agnostic method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. We incorporate additional regularizers to improve the feature diversity of the synthesized videos as well as the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished.
翻译:深度学习模型的成功已经导致其被显着的视频理解方法所采纳和应用。这些方法中的大部分利用联合时空模态中的特征进行编码,其中内部工作和学习到的表示很难进行视觉解释。我们提出LEArned Preconscious Synthesis(LEAPS),这是一种与体系结构无关的方法,用于从模型的内部时空表示中综合视频。使用刺激视频和目标类别,我们会引导一个固定的时空模型,并迭代优化一个由随机噪声初始化的视频。我们引入额外的正则化参数以改善综合视频的特征多样性和运动的帧间时序一致性。我们通过量化和定性评估LEAPS的适用性,反演了一系列基于Kinetics-400的时空卷积和注意力模型,这是我们所知道的尚未完成的工作。