Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.
翻译:每小时都在社交媒体和用户生成的内容平台上张贴大量视觉内容。为了通过自然语言查询找到相关视频,过去几年来,文本视频检索方法受到越来越多的关注。数据增强技术通过使用语义保存技术(如彩色空间或图像上的几何转换)来创建新的培训样本,以提高隐蔽测试实例的性能。然而,这些技术通常应用在原始数据上,导致更多的资源需求解决方案,并要求共享原始数据,而原始数据可能并不总是真实的,例如电影或电视系列片段的版权问题。为了解决这一缺陷,我们提议采用多式数据增强技术,在功能空间工作,通过将语义相似的样本混合来创建新的视频和字幕。我们在大规模公共数据集上实验我们的解决方案,EPIC-Kitchens-100,并在基线方法上实现相当大的改进,改进了艺术状态性能,同时进行多种反动研究。我们在 https://kovrivarub/Restraimacom上发布了关于Github_Vrievrievrievoria_DA.