Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranking results is an important and relatively under-explored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. In quest of a more lightweight solution, we explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25 models. We then propose how this approach can be improved through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
翻译:在很多情况下,创作者不仅重视其内容的曝光,而且了解内容出现的地方的具体搜索。确定哪些查询暴露了排序结果中某一部分内容的问题是一项重要且相对探索不足的搜索透明度挑战。披露查询有助于量化搜索偏差、隐私、数据保护、安全和搜索引擎优化等各种问题。在特定系统中对披露查询的精确识别在计算上是昂贵的,特别是在网络搜索等动态环境中。为了寻求更轻的解决方案,我们探索了将近似披露查询识别(EQI)作为检索任务的可行性,为此将查询和文件在两类搜索系统中的作用颠倒过来:密集的双编码模型和传统的BM25模型。然后我们提出如何通过检索嵌入空间的矩阵学习改进这一方法。我们进一步得出评估指标,以衡量披露查询的排序质量,并进行侧重于近似 EQI 的各种实际问题的经验分析。