大型语言模型时代下任务导向型数据集搜索的再审视：挑战、基准与解决方案 (Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution)

The search for suitable datasets is the critical "first step" in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the challenge of entity ambiguity, a unique semantic-based mechanism is used for task entity linking and dataset entity resolution. For online retrieval, KATS utilizes a specialized hybrid query engine that combines vector search with graph-based ranking to generate highly relevant results. Additionally, we introduce CS-TDS, a tailored benchmark suite for evaluating task-oriented dataset search systems, addressing the critical gap in standardized evaluation. Experiments on our benchmark suite show that KATS significantly outperforms state-of-the-art retrieval-augmented generation frameworks in both effectiveness and efficiency, providing a robust blueprint for the next generation of dataset discovery systems.

翻译：寻找合适的数据集是数据驱动研究中关键的“第一步”，但这仍然是一个巨大的挑战。研究人员通常需要基于高层次的任务描述来搜索数据集。然而，由于用户意图模糊、任务到数据集的映射与基准差距以及实体歧义性，现有的搜索系统在这一任务上表现不佳。为应对这些挑战，我们提出了KATS，一个面向任务、从非结构化科学文献中进行数据集搜索的新型端到端系统。KATS包含两个核心组件，即离线知识库构建和在线查询处理。其精密的离线流水线通过采用协作式多智能体框架进行信息抽取，自动构建了一个高质量、可动态更新的任务-数据集知识图谱，从而填补了任务到数据集的映射空白。为进一步解决实体歧义性挑战，系统采用了一种独特的基于语义的机制，用于任务实体链接和数据集实体解析。在线检索方面，KATS利用一个专门的混合查询引擎，结合向量搜索与基于图的排序，以生成高度相关的结果。此外，我们引入了CS-TDS，一个专门用于评估任务导向型数据集搜索系统的基准套件，以解决标准化评估方面的关键空白。在我们的基准套件上的实验表明，KATS在效果和效率上均显著优于最先进的检索增强生成框架，为下一代数据集发现系统提供了一个稳健的蓝图。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日