Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. To dramatically speed up the search latency of late interaction, we introduce the Performance-optimized Late Interaction Driver (PLAID). Without impacting quality, PLAID swiftly eliminates low-scoring passages using a novel centroid interaction mechanism that treats every passage as a lightweight bag of centroids. PLAID uses centroid interaction as well as centroid pruning, a mechanism for sparsifying the bag of centroids, within a highly-optimized engine to reduce late interaction search latency by up to 7$\times$ on a GPU and 45$\times$ on a CPU against vanilla ColBERTv2, while continuing to deliver state-of-the-art retrieval quality. This allows the PLAID engine with ColBERTv2 to achieve latency of tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale, even at the largest scales we evaluate with 140M passages.
翻译:培训前语言模式是多种信息检索模式中日益重要的组成部分。 与ColBERT模式一起引入的晚期互动,最近又在ColBERTv2中进行了完善。 晚期互动是与ColBERT2中引入的,这是一个流行的范例,在许多基准中都具有最先进的状态。 为了大大加快晚期互动的搜索延迟度,我们引入了“优化性能的晚期互动驱动器 ” ( PLAID) 。 在不影响质量的情况下,PLAID 快速消除了低分层通道。 PLAID 使用一个新型的机器人互动机制,将每条通道作为轻重量的机器人包处理。 PLAID 使用半机械式互动以及半机械处理,这是一种在高度优化的引擎中将整袋的固度对固度进行擦拭,以至7美元的速度在GPU上,45美元的速度在对Vanilla ColBERTv2 的CPU上,同时继续提供最先进的回收质量。 这样,PLAID 与CERT2 10或仅几百毫克的CPI在最大比例上达到10秒。