Entity resolution is a widely studied problem with several proposals to match records across relations. Matching textual content is a widespread task in many applications, such as question answering and search. While recent methods achieve promising results for these two tasks, there is no clear solution for the more general problem of matching textual content and structured data. We introduce a framework that supports this new task in an unsupervised setting for any pair of corpora, being relational tables or text documents. Our method builds a fine-grained graph over the content of the corpora and derives word embeddings to represent the objects to match in a low dimensional space. The learned representation enables effective and efficient matching at different granularity, from relational tuples to text sentences and paragraphs. Our flexible framework can exploit pre-trained resources, but it does not depends on their existence and achieves better quality performance in matching content when the vocabulary is domain specific. We also introduce optimizations in the graph creation process with an "expand and compress" approach that first identifies new valid relationships across elements, to improve matching, and then prunes nodes and edges, to reduce the graph size. Experiments on real use cases and public datasets show that our framework produces embeddings that outperform word embeddings and fine-tuned language models both in results' quality and in execution times.
翻译:实体的解决方案是一个广泛研究的问题, 包括多个匹配关系中记录的建议。 匹配文本内容是许多应用中的一项广泛任务, 比如问题回答和搜索。 虽然最近的方法为这两项任务取得了有希望的结果, 但对于匹配文本内容和结构化数据这一更为普遍的问题, 并没有明确的解决方案。 我们引入了一个框架, 在一个没有监督的环境中支持这一新任务, 对任何一对子的 Corsora 来说, 是关联表格或文本文件。 我们的方法是构建一个精细的图表, 并产生字嵌入, 以在低维度空间中代表对象匹配匹配。 学习的表达方式可以让不同的颗粒性( 从关系图示到文本句子和段落) 实现有效和高效的匹配。 我们的灵活框架可以开发预先训练的资源, 但是它并不取决于它们的存在, 并且当词汇是特定的域表或文本文档文档文档时, 在匹配内容方面实现更好的质量。 我们还在图形创建过程中引入优化, 以首先确定各个元素之间新的有效关系, 来改进匹配, 然后是精细节和边缘, 以缩小我们的图像格式和边缘, 来降低我们的图像质量 。 以显示实际格式格式格式格式 。 。 实验在格式中, 格式中, 实验使用真实的版本和嵌化格式中, 和 格式将显示实际格式中的数据和 。