We introduce a method to learn unsupervised semantic visual information based on the premise that complex events (e.g., minutes) can be decomposed into simpler events (e.g., a few seconds), and that these simple events are shared across several complex events. We split a long video into short frame sequences to extract their latent representation with three-dimensional convolutional neural networks. A clustering method is used to group representations producing a visual codebook (i.e., a long video is represented by a sequence of integers given by the cluster labels). A dense representation is learned by encoding the co-occurrence probability matrix for the codebook entries. We demonstrate how this representation can leverage the performance of the dense video captioning task in a scenario with only visual features. As a result of this approach, we are able to replace the audio signal in the Bi-Modal Transformer (BMT) method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual signal with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in captioning compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.
翻译:我们采用了一种方法来学习不受监督的语义视觉信息,其前提是,复杂的事件(例如,分钟)可以分解成更简单的事件(例如,几秒钟),这些简单的事件可以在若干复杂的事件中共享。我们将长视频分成短框序列,以便用三维共振动神经神经网络来提取其潜在代表。集集法用于制作视觉代码手册(即,长视频由集群标签给出的整数序列代表一个长视频)。通过将代码集条目的共生概率矩阵编码化,可以学到密集的表示法。我们展示了这种表示法如何在仅具有视觉特征的情景中利用密集的视频字幕任务的表现。作为这一方法的结果,我们能够取代Bi-Modal 变形器(BMT)方法中的音频信号,并以可比的性能生成时间建议。此外,我们把视觉信号和我们的解调器以香草变式方法结合成一个解调制成状态的描述法,在字幕条目条目条目条目中实现状态-艺术性能表现。我们展示的演示图理学方法仅与我们现有的多功能/变制式方法相比,只有可探索的视觉特征特征/变形法,只有可探索的视觉特性。