Therapeutics machine learning is an emerging field with incredible opportunities for innovatiaon and impact. However, advancement in this field requires formulation of meaningful learning tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and types of meaningful data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including real dataset distributional shifts, multi-scale modeling of heterogeneous data, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic and scientific advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is an open-science initiative available at https://tdcommons.ai.
翻译:治疗机的学习是一个新兴领域,有令人难以置信的革新和影响机会。然而,这个领域的进步需要制定有意义的学习任务和仔细整理数据集。在这里,我们引入了治疗数据共同点(TDC),这是系统存取和评估各种治疗方法的机器学习的第一个统一平台。迄今为止,TDC包括66个全新数据集,分布在22个学习任务中,覆盖了安全和有效药品的发现和开发。TDC还提供了一个工具和社区资源的生态系统,包括33个数据功能和有意义的数据分割类型、23个系统模型评估战略、17个分子生成器和29个公共领导板。所有资源都是通过开放的Python图书馆整合和获取的。我们对某些数据集进行了广泛的实验,表明即使是最强大的算法也不足以解决关键的治疗挑战,包括真实的数据集分布转移、多尺度的混杂数据建模和对新数据点的有力概括化。我们设想TDC能够促进算和科学的进步,并大大加快机器学习模型开发、验证和过渡到MITC的临床和临床应用。