Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data. Learning modality agnostic representations in a continuous latent space, however, is often treated as a black-box data-driven training process. It is well-known that the effectiveness of representation learning depends heavily on the quality and scale of training data. For video representation learning, having a complete set of labels that annotate the full spectrum of video content for training is highly difficult if not impossible. These issues, black-box training and dataset bias, make representation learning practically challenging to be deployed for video understanding due to unexplainable and unpredictable results. In this paper, we propose two novel training objectives, likelihood and unlikelihood functions, to unroll semantics behind embeddings while addressing the label sparsity problem in training. The likelihood training aims to interpret semantics of embeddings beyond training labels, while the unlikelihood training leverages prior knowledge for regularization to ensure semantically coherent interpretation. With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search. Extensive experiments on TRECVid and MSR-VTT datasets show the proposed network outperforms several state-of-the-art retrieval models with a statistically significant performance margin.
翻译:跨模式代表性学习已成为弥合文本和视觉数据之间语义差距的一种新常态。学习模式在连续潜藏空间中的不可知性表现往往被视为一个黑盒数据驱动的培训过程。众所周知,代表性学习的有效性在很大程度上取决于培训数据的质量和规模。对于视频代表学习而言,一套完整的标签,说明培训所需的全部视频内容,即使不是不可能,也是非常困难的。这些问题、黑盒培训和数据集偏差,使得代表学习变得具有实际挑战性,由于无法解释和无法预测的结果,因此难以为视频理解而部署。在本文件中,我们提出了两个新颖的培训目标、可能性和难懂性功能,以便在解决培训中的标签偏狭小问题的同时,在嵌入背后不动的语义。对于视频代表而言,可能的培训目的是解释超出培训标签范围的嵌入的语义,而不易懂的培训则利用先前的知识进行正规化,以确保语义一致的解释。两个培训目标都是一个新的编码解码-解码网络网络网络网络网络,通过跨模式进行解释的跨模式搜索,并用跨模式的图像搜索模型,用来在培训中学习跨模式的跨模式,为数据库显示数据格式。