Artificial Intelligence (AI) development is inherently iterative and experimental. Over the course of normal development, especially with the advent of automated AI, hundreds or thousands of experiments are generated and are often lost or never examined again. There is a lost opportunity to document these experiments and learn from them at scale, but the complexity of tracking and reproducing these experiments is often prohibitive to data scientists. We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts and provides features to reproduce these artifacts and perform meta-learning across them. We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment. The standardized nature of the stored metadata allows for querying and aggregation, especially in terms of ranking artifacts by performance metrics. We exhibit the capabilities of the LDE by reproducing an existing meta-learning study and storing the reproduced metadata in our system. Then, we perform two experiments on this metadata: 1) examining the reproducibility and variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data and examining how variability in experimental results impacts recommendation performance. The experimental results suggest significant variation in performance, especially depending on dataset configurations; this variation carries over when meta-learning is built on top of the results, with performance improving when using aggregated results. This suggests that a system that automatically collects and aggregates results such as the LDE not only assists in implementing meta-learning but may also improve its performance.
翻译:人工智能(AI)开发具有内在的迭接性和实验性。在正常开发过程中,特别是在自动化AI的出现期间,产生了数百或数千项实验,而且往往丢失或从未再检查过。我们失去了记录这些实验并大规模学习的机会,但追踪和复制这些实验的复杂性往往令数据科学家望而却步。我们展示了长寿实验数据库,自动提取和储存实验工艺品的链接元数据,并提供了复制这些工艺品和在它们之间进行元学习的特征。我们储存了人工智能开发生命周期多个阶段的背景,包括数据集、管道、每个元数据是如何配置以及培训如何以其运行时间环境的信息进行。存储的元数据的标准化性质使得能够进行查询和汇总,特别是用性能衡量尺度对文物进行排序。我们展示LDE的能力,即通过重新生成现有的元学习研究并将复制的元数据储存在我们的系统中进行。然后,我们对这一元数据进行两项实验:1)审查性指标的可复制和变异性,2)在应用一个数字学习结果进行若干次的测试性能变化时,在进行这种实验性结果的变异性分析时,以显示在顶层数据结果上的演变情况。