This paper aims at learning representations for long sequences of continuous signals. Recently, the BERT model has demonstrated the effectiveness of stacked transformers for representing sequences of discrete signals (i.e. word tokens). Inspired by its success, we adopt the stacked transformer architecture, but generalize its training objective to maximize the mutual information between the masked signals, and the bidirectional context, via contrastive loss. This enables the model to handle continuous signals, such as visual features. We further consider the case when there are multiple sequences that are semantically aligned at the sequence-level but not at the element-level (e.g. video and ASR), where we propose to use a Transformer to estimate the mutual information between the two sequences, which is again maximized via contrastive loss. We demonstrate the effectiveness of the learned representations on modeling long video sequences for action anticipation and video captioning. The results show that our method, referred to by Contrastive Bidirectional Transformer ({\bf CBT}), outperforms various baselines significantly. Furthermore, we improve over the state of the art.
翻译:本文旨在为连续信号的长序列进行学习。 最近, BERT 模型展示了堆叠变压器在代表离散信号序列(即字面符号)方面的有效性。 在其成功的启发下, 我们采用了堆叠变压器结构, 但其培训目标是通过对比性损失最大限度地扩大隐蔽信号与双向环境之间的相互信息。 这使得模型能够处理连续信号, 如视觉特征。 我们进一步审议了在序列级别上存在多个音义一致但并非元素级别( 如视频和ASR)的变压器的情况。 我们提议使用变压器来估计两个序列之间的相互信息, 通过对比性损失再次最大化。 我们展示了为行动预期和视频字幕制作长视频序列模型所学到的表述的有效性。 结果表明,我们用对比性双向变压器( bf CBT}) 所提到的方法, 超越了各种基线。 此外, 我们大大改进了艺术状态。