Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranking results is an important and relatively under-explored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25 models. We then propose how this approach can be improved through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.
翻译:在很多情况下,创作者不仅重视其内容的曝光,而且了解内容出现的具体搜索。在排序结果中,确定哪些查询暴露了某一内容的问题是一项重要且相对探索不足的搜索透明度挑战。披露查询有助于量化搜索偏差、隐私、数据保护、安全和搜索引擎优化等各种问题。在特定系统中,对披露查询的精确识别是计算成本高昂的,特别是在诸如网络搜索等动态环境中。我们探讨通过在两类搜索系统中改变查询和文件的作用,将查询识别作为检索任务的可能性:密集的双编码模型和传统的BM25模型。然后我们提出如何通过检索嵌入空间的衡量学习改进这一方法。我们进一步获得评估指标,以衡量披露查询的排序质量,并进行侧重于近似于 EQI的各种实际问题的实证分析。总的来说,我们的工作有助于在搜索系统和计算手段上的透明度的新概念。