We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.
翻译:我们为时装专用多式代表制展示了一个隐蔽的视觉变压器(MVLT),从技术上讲,我们只是利用视觉变压器结构在培训前模式中取代BERT,使MVLT成为时装领域的第一个端到端框架。此外,我们设计了蒙面图像重建(MIR),以细微理解时装。MVLT是一个可扩展和方便的建筑,允许原始的多式投入,而没有额外的预处理模型(例如ResNet),隐含地模拟了视觉语言调整。更重要的是,MVLT可以很容易地概括各种匹配和基因化任务。实验结果显示,对Fashason-Gen 2018赢家Kaleido-BERT的检索(rank@5:17%)和识别(准确度:3%)任务有明显的改进。代码可在https://github.com/GewelsJI/MVLT上查阅。