This report describes our solution to the VALUE Challenge 2021 in the captioning task. Our solution, named CLIP4Caption++, is built on X-Linear/X-Transformer, which is an advanced model with encoder-decoder architecture. We make the following improvements on the proposed CLIP4Caption++: We employ an advanced encoder-decoder model architecture X-Transformer as our main framework and make the following improvements: 1) we utilize three strong pre-trained CLIP models to extract the text-related appearance visual features. 2) we adopt the TSN sampling strategy for data enhancement. 3) we involve the video subtitle information to provide richer semantic information. 3) we introduce the subtitle information, which fuses with the visual features as guidance. 4) we design word-level and sentence-level ensemble strategies. Our proposed method achieves 86.5, 148.4, 64.5 CIDEr scores on VATEX, YC2C, and TVC datasets, respectively, which shows the superior performance of our proposed CLIP4Caption++ on all three datasets.
翻译:本报告描述了我们在说明任务中应对2021年增值挑战的解决方案。我们的解决方案名为CLIP4Caption++,以X-Linear/X-Transexter为基础,这是一个带有编码器-编码器结构的先进模型。我们对拟议的CLIP4Caption++:我们使用先进的编码器-编码器模型架构X-转换软件作为我们的主要框架,并作出以下改进:(1)我们使用三个经过预先训练的CLIP模型来提取与文本有关的外观特征。(2)我们采用TRN抽样战略来增强数据。(3)我们使用视频字幕信息来提供更丰富的语义信息。(3)我们采用字幕信息,将字幕与视觉特征相结合,作为指导。(4)我们设计了字级和句级共用战略。我们的拟议方法分别达到86.5、148.4、64.5、64.5 VATEX、YC2C和TVC数据集的CIDER评分数,这显示了我们提议的CIP4C++所有三个数据集的优性表现。