Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks to enable the evaluations on the price comparison and personalized recommendations. For both instance-level tasks, how to accurately pinpoint the product target mentioned in the visual-linguistic data and effectively decrease the influence of irrelevant contents is quite challenging. To address this, we exploit to train a more effective cross-modal pertaining model which is adaptively capable of incorporating key concept information from the multi-modal data, by using an entity graph whose node and edge respectively denote the entity and the similarity relation between entities. Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantic. Experimental results well verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP, UNITER and CAPTURE.
翻译:我们的研究目标是研究一种更现实的环境,在这个环境中,我们可以对微粒产品类别进行监管薄弱的多模式实例级产品检索;我们首先提供产品1M数据集,并定义两种实际实例级检索任务,以便能够对价格比较和个人化建议进行评估;对于这两种实例级任务,如何准确确定视觉语言数据中提到的产品目标并有效减少不相关内容的影响是相当具有挑战性的;为了解决这个问题,我们利用培训一种更有效的跨模式相关模式,这种模式能够适应性地纳入多模式数据的关键概念信息;我们首先提供产品1M数据集,并定义两种实际的实例级检索任务,以便能够对价格比较和个人化建议进行评价;具体地说,为实例级商品检索建议一种新型实体精确地确定视觉语言数据中所提到的产品目标,从而通过一个自我控制的混合变异式变异器,从而能够减少实体之间在实体-强化交叉模式前变异式前变异性(C-CMP-CMP-Mretregretal train)之间的混乱,从而可以有效地核查不同对象网络、共同变异性EGEGURal-C-C-C-C-I-C-C-I-C-I-I-I-C-IAL-C-C-C-C-I-I-I-C-C-C-C-I-I-I-C-I-I-I-I-C-I-I-I-C-I-I-C-C-C-C-C-C-C-C-I-C-C-C-C-C-C-I-C-C-C-C-C-C-I-I-I-I-I-I-I-C-I-I-C-C-I-I-I-I-C-I-I-I-I-I-I-I-I-A-I-A-A-I-I-A-C-I-C-C-A-A-A-I-I-A-A-A-A-A-A-I-I-A-A-I-I-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-C-C-C-C-C