MLCAsk: 高效管理合作数据分析管道组成部分的演变 (MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines)

from arxiv, 13 pages; added new baselines, i.e., MLflow and ModelDB, in Section VII-C; added experience on the system deployment in Section VIII; added Table I to clarify the correctness of the prioritized pipeline search in Section VII-E

With the ever-increasing adoption of machine learning for data analytics, maintaining a machine learning pipeline is becoming more complex as both the datasets and trained models evolve with time. In a collaborative environment, the changes and updates due to pipeline evolution often cause cumbersome coordination and maintenance work, raising the costs and making it hard to use. Existing solutions, unfortunately, do not address the version evolution problem, especially in a collaborative environment where non-linear version control semantics are necessary to isolate operations made by different user roles. The lack of version control semantics also incurs unnecessary storage consumption and lowers efficiency due to data duplication and repeated data pre-processing, which are avoidable. In this paper, we identify two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask. The system supports multiple user roles with the ability to perform Git-like branching and merging operations in the context of the machine learning pipelines. We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information. Further, we design and implement the prioritized pipeline search, which gives preference to the pipelines that probably yield better performance. The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases. The performance evaluation shows that the proposed merge operation is up to 7.8x faster and saves up to 11.9x storage space than the baseline method that does not utilize history records.

翻译：随着数据分析日益采用机器学习方法,随着数据集和经过培训的模型的不断演变,保持机器学习管道变得越来越复杂,随着数据集和经过培训的模型的不断演变,维护机器学习管道变得日益复杂。在合作环境中,由于管道演变而产生的变化和更新往往导致协调和维护工作繁琐、成本增加和难以使用。不幸的是,现有解决方案无法解决版本演变问题,特别是在非线性版本控制语义需要将不同用户角色的操作分离开来的协作环境中。版本控制语义的缺乏也造成不必要的存储消耗和效率降低,因为数据重复和重复的预处理是可以避免的。在本文中,我们找出了在安装机器学习管道过程中出现的两大挑战,这些变化和更新往往导致协调和维护,并解决了这些挑战,为最终至最终分析系统设计了版本。系统支持多种用户的作用,在机器学习管道中能够进行类似分支化和合并操作。我们定义并加快了标准化驱动的合并操作,方法是利用可再利用的基线搜索树和重复的预处理,这是可以避免的。在本文件中,我们找出了在安装可更新的历史记录和编程过程中的进度记录时,可以进行更精确的运行。

相关内容

Machine Learning

关注 2242

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/