We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin. Moreover, state-of-the-art results are achieved with a single model on two datasets without finetuning. This multidomain generalisation is achieved by a proper combination of different video caption datasets. We show that training on different datasets can improve test results of each other. Additionally we check intersection between many popular datasets and found that MSRVTT has a significant overlap between the test and the train parts, and the same situation is observed for ActivityNet.
翻译:我们对MSRVTT和LSMDC基准的视频检索任务文本展示了新的最新技术,我们的模型大大优于以往所有解决方案。此外,最先进的结果是在两个数据集上采用单一的模型,而没有微调。这种多域概括是通过不同视频字幕数据集的适当组合实现的。我们显示,关于不同数据集的培训可以改善彼此的测试结果。此外,我们检查了许多受欢迎的数据集之间的交叉点,发现MSRVTT在测试与列车部件之间有很大的重叠,对ActionNet也观察到同样的情况。