Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural language; this is particularly true for systems specialized in visually-dense information, such as dialogue, recommendation, and search engines for clothing. To this end, we train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images. The key to the successful training of our VQA model is the automatic creation of a visual question-answering dataset with 168 million samples from item attributes of 207 thousand images using diverse templates. The sample generation employs a strategy that considers the difficulty of the question-answer pairs to emphasize challenging concepts. Contrary to the recent trends in using several datasets for pretraining the visual question answering models, we focused on keeping the dataset fixed while training various models from scratch to isolate the improvements from model architecture changes. We see that using the same transformer for encoding the question and decoding the answer, as in language models, achieves maximum accuracy, showing that visual language models (VLMs) make the best visual question answering systems for our dataset. The accuracy of the best model surpasses the human expert level, even when answering human-generated questions that are not confined to the template formats. Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language. The training of such domain-expert models, e.g., our fashion VLM model, cannot rely solely on the large-scale general-purpose datasets collected from the web.
翻译:人类通过各种感官模式来控制世界,而语言则是他们的主要交流渠道。 机器学习系统需要利用同样的多式联运丰富度,与自然语言的人类进行知情的谈话;对于专门处理视觉敏感信息的系统,例如对话、建议和服装搜索引擎,尤其如此。 为此,我们训练视觉问答(VQA)系统,以解答关于时装摄影图像的复杂自然语言问题。 成功培训我们的VQA模型的关键是自动创建一个视觉问答数据集,其样本来自项目属性的1.68亿个样本,使用各种模板的207,000个图像。 样本生成使用一种战略,考虑到问答对强调具有挑战性概念的困难。 与最近使用若干数据集来预先训练视觉问题解答模型的趋势相反,我们专注于固定数据集,同时培训各种模型,从刮起将改进与模型结构变化相隔绝。 我们发现,使用相同的变动器来将问题和解析解析答案,如同语言模型一样,达到最大精确度,甚至可以依靠直观语言模型,显示视觉语言解析对概念模型进行最精确的路径,而不能用最精确的模型来解算。 人类直观域域的模型, 解析的模型是人类的模型, 。 最精确的模型是人类的路径, 。 最精确的模型的模型的模型,用于用于我们最精确的模型的路径的路径的路径的路径的路径,用于人类的路径, 的路径,, 的模型, 解析域域域域域的路径,用于我们的模型, 解析。