This technical report summarizes our method for the Video-And-Language Understanding Evaluation (VALUE) challenge (https://value-benchmark.github.io/challenge\_2021.html). We propose a CLIP-Enhanced method to incorporate the image-text pretrained knowledge into downstream video-text tasks. Combined with several other improved designs, our method outperforms the state-of-the-art by $2.4\%$ ($57.58$ to $60.00$) Meta-Ave score on VALUE benchmark.
翻译:本技术报告总结了我们进行视频和语言理解评价的方法(https://value-benchmark.github.io/challenge ⁇ 2021.html)的挑战,我们建议采用CLIP-加强的方法,将图像-文字预先培训的知识纳入下游视频-文字任务,加上其他几项改进的设计,我们的方法比最新工艺高出2.4美元(57.58美元至60.00美元),比VALUE基准的Meta-Ave得分高出0.4美元(57.58美元至60美元)。