Retrieving clothes which are worn in social media videos (Instagram, TikTok) is the latest frontier of e-fashion, referred to as "video-to-shop" in the computer vision literature. In this paper we present MovingFashion, the first publicly available dataset to cope with this challenge. MovingFashion is composed of 14855 social videos, each one of them associated to e-commerce "shop" images where the corresponding clothing items are clearly portrayed. In addition, we present a network for retrieving the shop images in this scenario, dubbed SEAM Match-RCNN. The model is trained by image-to-video domain adaptation, allowing to use video sequences where only their association with a shop image is given, eliminating the need of millions of annotated bounding boxes. SEAM Match-RCNN builds an embedding, where an attention-based weighted sum of few frames (10) of a social video is enough to individuate the correct product within the first 5 retrieved items in a 14K+ shop element gallery with an accuracy of 80%. This provides the best performance on MovingFashion, comparing exhaustively against the related state-of-the-art approaches and alternative baselines.
翻译:社会媒体视频(Instagram, TikTok)中穿戴的衣服(Instagram, TikTok)是电子时装的最新前沿,在计算机视觉文献中被称为“视频到商店”的“视频到商店”。在本文中,我们展示了移动时装,这是应对这一挑战的第一个公开可用的数据集。移动时装由14855个社会视频组成,每个视频都与电子商务“商店”图像有关,其中每个视频都与电子商务“商店”图像有明确描述。此外,我们展示了一个网络,用于在此情景中检索商店图像,称为SEAM Match-RCNN。该模型经过图像到视频域的调整培训,允许在仅与商店图像有关系的情况下使用视频序列,从而消除了数百万个附加框的需要。SEAM Match-RCNN建立嵌套嵌套,其中以关注为基础的加权数框架(10),足以在14K+商店构件库中注入正确的产品,准确度达80%。这提供了移动时程和远距相关基准的最佳表现。