提高可控生成查询的搜索内容可检索性 (Improving Content Retrievability in Search with Controllable Query Generation)

An important goal of online platforms is to enable content discovery, i.e. allow users to find a catalog entity they were not familiar with. A pre-requisite to discover an entity, e.g. a book, with a search engine is that the entity is retrievable, i.e. there are queries for which the system will surface such entity in the top results. However, machine-learned search engines have a high retrievability bias, where the majority of the queries return the same entities. This happens partly due to the predominance of narrow intent queries, where users create queries using the title of an already known entity, e.g. in book search 'harry potter'. The amount of broad queries where users want to discover new entities, e.g. in music search 'chill lyrical electronica with an atmospheric feeling to it', and have a higher tolerance to what they might find, is small in comparison. We focus here on two factors that have a negative impact on the retrievability of the entities (I) the training data used for dense retrieval models and (II) the distribution of narrow and broad intent queries issued in the system. We propose CtrlQGen, a method that generates queries for a chosen underlying intent-narrow or broad. We can use CtrlQGen to improve factor (I) by generating training data for dense retrieval models comprised of diverse synthetic queries. CtrlQGen can also be used to deal with factor (II) by suggesting queries with broader intents to users. Our results on datasets from the domains of music, podcasts, and books reveal that we can significantly decrease the retrievability bias of a dense retrieval model when using CtrlQGen. First, by using the generated queries as training data for dense models we make 9% of the entities retrievable (go from zero to non-zero retrievability). Second, by suggesting broader queries to users, we can make 12% of the entities retrievable in the best case.

翻译：在线平台的一个重要目标是使内容可以被发现，即允许用户找到他们不熟悉的目录实体。要使用搜索引擎发现实体，例如书籍，一个前提条件是实体可检索，即系统可以为查询返回该实体的前几个结果。然而，机器学习的搜索引擎存在着高可检索性偏见，其中大多数查询返回相同的实体。这部分是由于狭窄意图查询的优势，其中用户使用已知实体的标题创建查询，例如在书籍搜索中的“哈利·波特”。用户想要发现新事物的广泛查询，例如在音乐搜索中的“冷静的抒情电子音乐，并带有一种大气的感觉”，并且对他们可能会发现什么有更高的容忍度，相对较少。我们的研究关注对实体可检索性产生负面影响的两个因素：（I）用于密集检索模型的训练数据和（II）系统中发出的狭义和广义查询的分布。我们提出了CtrlQGen，一种可以生成针对所选下层意图（狭义或广义）的查询的方法。我们可以使用CtrlQGen通过生成多样化的合成查询的用于密集检索模型的训练数据，从而改善第1个因素（I）。CtrlQGen还可以通过向用户建议更广泛的查询来应对第2个因素（II）。我们在音乐、播客和书籍等领域的数据集上的结果表明，在使用CtrlQGen时，我们可以显着减少密集检索模型的可检索性偏差。首先，在使用生成的查询作为密集模型的训练数据时，我们使得9%的实体可检索（从不可到可）。其次，通过向用户建议更广泛的查询，我们可以在最好的情况下使12%的实体可检索。