基于大语言模型的结构化多步推理实体匹配方法 (Structured Multi-Step Reasoning for Entity Matching Using Large Language Model)

Entity matching is a fundamental task in data cleaning and data integration. With the rapid adoption of large language models (LLMs), recent studies have explored zero-shot and few-shot prompting to improve entity matching accuracy. However, most existing approaches rely on single-step prompting and offer limited investigation into structured reasoning strategies. In this work, we investigate how to enhance LLM-based entity matching by decomposing the matching process into multiple explicit reasoning stages. We propose a three-step framework that first identifies matched and unmatched tokens between two records, then determines the attributes most influential to the matching decision, and finally predicts whether the records refer to the same real-world entity. In addition, we explore a debate-based strategy that contrasts supporting and opposing arguments to improve decision robustness. We evaluate our approaches against multiple existing baselines on several real-world entity matching benchmark datasets. Experimental results demonstrate that structured multi-step reasoning can improve matching performance in several cases, while also highlighting remaining challenges and opportunities for further refinement of reasoning-guided LLM approaches.

翻译：实体匹配是数据清洗与数据集成中的基础任务。随着大语言模型的快速普及，近期研究探索了零样本与少样本提示学习以提升实体匹配精度。然而，现有方法多依赖单步提示策略，对结构化推理机制的探索较为有限。本研究旨在通过将匹配过程分解为多个显式推理阶段，以增强基于大语言模型的实体匹配性能。我们提出一个三步框架：首先识别两条记录间匹配与非匹配的词汇单元，继而判定对匹配决策最具影响力的属性特征，最终预测两条记录是否指向现实世界中的同一实体。此外，我们探索了一种基于辩论的策略，通过对比支持性与反对性论据以提升决策鲁棒性。我们在多个真实世界实体匹配基准数据集上，将所提方法与现有基线模型进行对比评估。实验结果表明，结构化多步推理在多种场景下能有效提升匹配性能，同时也揭示了推理引导的大语言模型方法仍需面对的挑战与进一步优化的机遇。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【NeurIPS 2024】基于大型语言模型的三层学习用于时间序列OOD泛化

专知会员服务

19+阅读 · 2024年10月13日

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

22+阅读 · 2023年5月10日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【TPAMI2022】关联关系驱动的多模态分类，AF: An Association-based Fusion Method for Multi-Modal Classification

专知会员服务

27+阅读 · 2022年3月22日