与多任务视觉代表的愿景变异器进行数十亿级预科培训 (Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations)

Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively unexplored. We consider the case of a popular visual discovery product, where these representations are trained with multi-task learning, from use-case specific visual understanding (e.g. skin tone classification) to general representation learning for all visual content (e.g. embeddings for retrieval). In this work, we describe how we (1) generate a dataset with over a billion images via large weakly-supervised pretraining to improve the performance of these visual representations, and (2) leverage Transformers to replace the traditional convolutional backbone, with insights into both system and performance improvements, especially at 1B+ image scale. To support this backbone model, we detail a systematic approach to deriving weakly-supervised image annotations from heterogenous text signals, demonstrating the benefits of clustering techniques to handle the long-tail distribution of image labels. Through a comprehensive study of offline and online evaluation, we show that large-scale Transformer-based pretraining provides significant benefits to industry computer vision applications. The model is deployed in a production visual shopping system, with 36% improvement in top-1 relevance and 23% improvement in click-through volume. We conduct extensive experiments to better understand the empirical relationships between Transformer-based architectures, dataset scale, and the performance of production vision systems.

翻译：对视觉表现进行大规模的大规模初步培训,导致在一系列基准计算机视觉任务上取得最先进的业绩,然而,这些技术在复杂生产系统中的极端规模的极端规模技术的效益相对而言尚未探索。我们考虑了一种流行的视觉发现产品的情况,在这种产品中,通过多任务学习,从使用-个案具体视觉理解(例如皮肤语调分类)到所有视觉内容的一般代表性学习(例如嵌入检索),导致在一系列基准计算机视觉任务上取得最先进的业绩表现。在这项工作中,我们描述了我们如何(1)通过大规模微弱监督的预培训,生成一个拥有超过10亿图像的数据集,其中含有超过10亿图像的大规模图像,以改善这些视觉表现;以及(2)利用变压器取代传统的革命骨干,同时了解系统和性改进性能,特别是在1B+图像规模上。为了支持这一主干模型,我们详细制定了一种系统化的方法,从杂质文本信号中得出薄弱的超超强的图像说明,展示了基于集群技术处理图像系统长期分布的图象标签的好处。通过对离线和在线的模型评估,我们展示了大规模、大规模变压型模型的造型模型的模型,我们展示了在视觉制作过程中的系统上对23级模型应用前的升级的系统改进了自我分析过程的系统上,在23的改进了自我分析结构中提供了重大的改进。