Image-Guided Retrieval with Optional Text (IGROT) unifies visual retrieval (without text) and composed retrieval (with text). Despite its relevance in applications like Google Image and Bing, progress has been limited by the lack of an accessible benchmark and methods that balance performance across subtasks. Large-scale datasets such as MagicLens are comprehensive but computationally prohibitive, while existing models often favor either visual or compositional queries. We introduce FIGROTD, a lightweight yet high-quality IGROT dataset with 16,474 training triplets and 1,262 test triplets across CIR, SBIR, and CSTBIR. To reduce redundancy, we propose the Variance Guided Feature Mask (VaGFeM), which selectively enhances discriminative dimensions based on variance statistics. We further adopt a dual-loss design (InfoNCE + Triplet) to improve compositional reasoning. Trained on FIGROTD, VaGFeM achieves competitive results on nine benchmarks, reaching 34.8 mAP@10 on CIRCO and 75.7 mAP@200 on Sketchy, outperforming stronger baselines despite fewer triplets.
翻译:图像引导检索与可选文本(IGROT)统一了无文本的视觉检索与含文本的组合检索。尽管在Google Image和Bing等应用中具有重要价值,但由于缺乏易于获取的基准数据集以及能够平衡各子任务性能的方法,该领域进展有限。大规模数据集如MagicLens虽全面但计算成本过高,而现有模型往往偏向于视觉查询或组合查询中的一种。本文提出了FIGROTD——一个轻量级但高质量的IGROT数据集,包含16,474个训练三元组和1,262个测试三元组,涵盖CIR、SBIR和CSTBIR任务。为减少冗余,我们提出方差引导特征掩码(VaGFeM),该模块基于方差统计量选择性增强判别性特征维度。我们进一步采用双损失设计(InfoNCE + Triplet损失)以提升组合推理能力。在FIGROTD上训练的VaGFeM在九个基准测试中取得了具有竞争力的结果:在CIRCO上达到34.8 mAP@10,在Sketchy上达到75.7 mAP@200,尽管使用更少的三元组,仍超越了更强的基线模型。