Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.
翻译:机器学习(ML)具有革新众多领域的潜力,但其应用常因领域专家需求与将这些需求转化为稳健有效ML工具之间的脱节而受阻。尽管近期基于大型语言模型(LLM)的协同驾驶系统在降低非技术领域专家的ML使用门槛方面取得进展,这些系统仍主要聚焦于以模型为中心的环节,而忽视了关键的以数据为中心的挑战。这一局限在复杂的现实场景中尤为突出,因为原始数据常包含缺失值、标签噪声及需要定制化处理的领域特定细微差别等复杂问题。为此,我们提出CliMB-DC——一个人类引导、以数据为中心的LLM协同驾驶框架,其通过将先进的数据中心化工具与LLM驱动的推理相结合,实现稳健且情境感知的数据处理。该框架的核心是一个创新的多智能体推理系统,它整合了用于动态规划与适应的战略协调器与负责精确执行的专用工作智能体。领域专业知识随后通过人在回路的方被系统地纳入以引导推理过程。为指引开发,我们形式化地构建了协同驾驶系统必须应对的关键数据中心化挑战的分类体系。随后,为应对该分类体系的各个维度,我们将最先进的数据中心化工具集成至一个可扩展的开源架构中,便于研究社区添加新工具。通过基于真实世界医疗数据集的实证研究,我们证明了CliMB-DC能够将未整理的数据集转化为ML就绪格式,在处理数据中心化挑战方面显著优于现有协同驾驶基线系统。CliMB-DC有望赋能医疗、金融、社会科学等多元领域的专家,使其能积极参与利用ML驱动现实世界影响。