Existing methods for interactive image retrieval have demonstrated the merit of integrating user feedback, improving retrieval results. However, most current systems rely on restricted forms of user feedback, such as binary relevance responses, or feedback based on a fixed set of relative attributes, which limits their impact. In this paper, we introduce a new approach to interactive image search that enables users to provide feedback via natural language, allowing for more natural and effective interaction. We formulate the task of dialog-based interactive image retrieval as a reinforcement learning problem, and reward the dialog system for improving the rank of the target image during each dialog turn. To avoid the cumbersome and costly process of collecting human-machine conversations as the dialog system learns, we train our system with a user simulator, which is itself trained to describe the differences between target and candidate images. The efficacy of our approach is demonstrated in a footwear retrieval application. Extensive experiments on both simulated and real-world data show that 1) our proposed learning framework achieves better accuracy than other supervised and reinforcement learning baselines and 2) user feedback based on natural language rather than pre-specified attributes leads to more effective retrieval results, and a more natural and expressive communication interface.
翻译:互动图像检索的现有方法展示了整合用户反馈、改善检索结果的优点。然而,大多数现有系统都依赖有限的用户反馈形式,例如二进制相关性回应或基于固定的相对属性的反馈,这限制了它们的影响。在本文件中,我们引入了交互式图像搜索的新办法,使用户能够通过自然语言提供反馈,允许更自然和有效的互动。我们把基于对话的互动式图像检索任务设计成一个强化学习问题,并奖励在每次对话框转弯期间提高目标图像等级的对话系统。为了避免在对话系统学习时收集人体机器对话的过程繁琐和费用高昂,我们用用户模拟器对我们的系统进行培训,该模拟器本身受过培训,可以描述目标和候选图像之间的差异。我们的方法的功效表现在鞋类检索应用程序中。关于模拟数据和真实世界数据的广泛实验表明:(1)我们拟议的学习框架比其他受监督和加强学习基线和(2)基于自然语言而不是事先指定的属性的用户反馈更准确,导致更有效的检索结果,以及更自然和直白的通信界面。