The search for suitable datasets is the critical "first step" in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the challenge of entity ambiguity, a unique semantic-based mechanism is used for task entity linking and dataset entity resolution. For online retrieval, KATS utilizes a specialized hybrid query engine that combines vector search with graph-based ranking to generate highly relevant results. Additionally, we introduce CS-TDS, a tailored benchmark suite for evaluating task-oriented dataset search systems, addressing the critical gap in standardized evaluation. Experiments on our benchmark suite show that KATS significantly outperforms state-of-the-art retrieval-augmented generation frameworks in both effectiveness and efficiency, providing a robust blueprint for the next generation of dataset discovery systems.
翻译:寻找合适的数据集是数据驱动研究中关键的“第一步”,但这仍然是一个巨大的挑战。研究人员通常需要基于高层次的任务描述来搜索数据集。然而,由于用户意图模糊、任务到数据集的映射与基准差距以及实体歧义性,现有的搜索系统在这一任务上表现不佳。为应对这些挑战,我们提出了KATS,一个面向任务、从非结构化科学文献中进行数据集搜索的新型端到端系统。KATS包含两个核心组件,即离线知识库构建和在线查询处理。其精密的离线流水线通过采用协作式多智能体框架进行信息抽取,自动构建了一个高质量、可动态更新的任务-数据集知识图谱,从而填补了任务到数据集的映射空白。为进一步解决实体歧义性挑战,系统采用了一种独特的基于语义的机制,用于任务实体链接和数据集实体解析。在线检索方面,KATS利用一个专门的混合查询引擎,结合向量搜索与基于图的排序,以生成高度相关的结果。此外,我们引入了CS-TDS,一个专门用于评估任务导向型数据集搜索系统的基准套件,以解决标准化评估方面的关键空白。在我们的基准套件上的实验表明,KATS在效果和效率上均显著优于最先进的检索增强生成框架,为下一代数据集发现系统提供了一个稳健的蓝图。