Accurate and efficient product classification is significant for E-commerce applications, as it enables various downstream tasks such as recommendation, retrieval, and pricing. Items often contain textual and visual information, and utilizing both modalities usually outperforms classification utilizing either mode alone. In this paper we describe our methodology and results for the SIGIR eCom Rakuten Data Challenge. We employ a dual attention technique to model image-text relationships using pretrained language and image embeddings. While dual attention has been widely used for Visual Question Answering(VQA) tasks, ours is the first attempt to apply the concept for multimodal classification.
翻译:准确而高效的产品分类对于电子商务应用十分重要,因为它有助于诸如建议、检索和定价等各种下游任务,物品往往含有文字和视觉信息,并且仅仅使用这两种方式通常都优于一种模式的分类。在本文件中,我们描述了我们用于SIGIR eCom Rokuten数据挑战的方法和结果。我们使用一种双重关注技术,利用预先培训的语言和图像嵌入方式来模拟图像-文字关系模式。虽然在视觉问题解答(VQA)任务中广泛使用双重关注,但我们是首次尝试采用多式联运分类概念。