Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc. While the approach has demonstrated its efficacy in tasks like semantic matching and contextual search, it is plagued by the problem of uncontrollable relevance. In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine, and define two main categories of failures introduced by it, integrity and junkiness. The former refers to issues such as hate speech and offensive content that can severely harm user experience, while the latter includes irrelevant results like fuzzy text matching or language mismatches. Efficient methods during model inference are further proposed to resolve the issue, including indexing treatments and targeted user cohort treatments, etc. Though being simple, we show the methods have good offline NDCG and online A/B tests metrics gain in practice. We analyze the reasons for the improvements, pointing out that our methods are only preliminary attempts to this important but challenging problem. We put forward potential future directions to explore.
翻译:基于嵌入的检索已在电子商务、社交网络搜索等各种搜索应用中得到应用。虽然该方法在语义匹配和上下文搜索等任务中已经证明了其功效,但是它却困扰着无法控制的相关性的问题。在本文中,我们对于2021年初在我们的社交网络搜索引擎上推出的基于嵌入的检索进行了分析,并定义了由此引入的两个主要故障类别,一是完整性故障,二是垃圾类故障。前者涉及到恶意言论和冒犯性内容等问题,可能严重影响用户体验,而后者包括模糊文本匹配或语言不匹配等不相关的结果。我们进一步提出了用于解决这一问题的高效方法,包括索引处理和针对特定用户分组等。虽然这些方法很简单,但是实践表明其在离线NDCG以及在线A/B测试度量指标上均有良好的成效。我们分析了这些改进的原因,并指出,我们的方法只是这个重要而具有挑战性问题的初步尝试。我们提出潜在的未来研究方向。