Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in production. The process of operationalizing ML, or MLOps, consists of a continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production. When considered together, these responsibilities seem staggering -- how does anyone do MLOps, what are the unaddressed challenges, and what are the implications for tool builders? We conducted semi-structured ethnographic interviews with 18 MLEs working across many applications, including chatbots, autonomous vehicles, and finance. Our interviews expose three variables that govern success for a production ML deployment: Velocity, Validation, and Versioning. We summarize common practices for successful ML experimentation, deployment, and sustaining production performance. Finally, we discuss interviewees' pain points and anti-patterns, with implications for tool design.
翻译:各组织依靠机械学习工程师(MLEs)来操作ML,即:在生产过程中部署和维持ML管道。MLPs(MLPs)的过程包括连续不断的循环:(一) 数据收集和标签;(二) 改进ML性能的实验;(三) 在整个多阶段部署过程中进行评价;(四) 监测性能下降。当一并考虑,这些责任似乎令人吃惊 -- -- 任何人如何做MLOPs,什么是尚未解决的挑战,对工具制造者的影响是什么?我们与18个MLS进行了半结构化人种特征的访谈,这些访谈涉及许多应用,包括聊天机器人、自主车辆和财务。我们的访谈揭示了决定ML成功部署的三个变量:速度、验证和版本。我们总结了ML成功试验、部署和维持生产绩效的常见做法。最后,我们讨论了受访者的痛苦点和反模式,对工具设计有影响。