Large Language Models (LLMs) are being increasingly used within data systems to process large datasets with text fields. A broad class of such tasks involves a semantic join-joining two tables based on a natural language predicate per pair of tuples, evaluated using an LLM. Semantic joins generalize tasks such as entity matching and record categorization, as well as more complex text understanding tasks. A naive implementation is expensive as it requires invoking an LLM for every pair of rows in the cross product. Existing approaches mitigate this cost by first applying embedding-based semantic similarity to filter candidate pairs, deferring to an LLM only when similarity scores are deemed inconclusive. However, these methods yield limited gains in practice, since semantic similarity may not reliably predict the join outcome. We propose Featurized-Decomposition Join (FDJ for short), a novel approach for performing semantic joins that significantly reduces cost while preserving quality. FDJ automatically extracts features and combines them into a logical expression in conjunctive normal form that we call a featurized decomposition to effectively prune out non-matching pairs. A featurized decomposition extracts key information from text records and performs inexpensive comparisons on the extracted features. We show how to use LLMs to automatically extract reliable features and compose them into logical expressions while providing statistical guarantees on the output result-an inherently challenging problem due to dependencies among features. Experiments on real-world datasets show up to 10 times reduction in cost compared with the state-of-the-art while providing the same quality guarantees.
翻译:大型语言模型(LLMs)正日益广泛地应用于数据系统中,以处理包含文本字段的大规模数据集。其中一类重要任务涉及语义连接——基于每对元组的自然语言谓词(使用LLM进行评估)来连接两个表。语义连接泛化了实体匹配、记录分类以及更复杂的文本理解任务。朴素实现成本高昂,因为它需要对笛卡尔积中的每一对行调用LLM。现有方法通过首先应用基于嵌入的语义相似性来筛选候选对,仅在相似性得分被认为不确定时才调用LLM,从而降低开销。然而,由于语义相似性可能无法可靠预测连接结果,这些方法在实践中带来的增益有限。我们提出特征化分解连接(简称FDJ),这是一种执行语义连接的新方法,能在保持质量的同时显著降低成本。FDJ自动提取特征并将其组合为合取范式的逻辑表达式(我们称之为特征化分解),以有效剪除非匹配对。特征化分解从文本记录中提取关键信息,并对提取的特征执行低成本比较。我们展示了如何利用LLMs自动提取可靠特征并将其组合为逻辑表达式,同时为输出结果提供统计保证——由于特征间的依赖性,这本质上是一个具有挑战性的问题。在真实数据集上的实验表明,与现有最先进方法相比,FDJ在提供相同质量保证的同时,成本降低了高达10倍。