Software organizations are increasingly incorporating machine learning (ML) into their product offerings, driving a need for new data management tools. Many of these tools facilitate the initial development of ML applications, but sustaining these applications post-deployment is difficult due to lack of real-time feedback (i.e., labels) for predictions and silent failures that could occur at any component of the ML pipeline (e.g., data distribution shift or anomalous features). We propose a new type of data management system that offers end-to-end observability, or visibility into complex system behavior, for deployed ML pipelines through assisted (1) detection, (2) diagnosis, and (3) reaction to ML-related bugs. We describe new research challenges and suggest preliminary solution ideas in all three aspects. Finally, we introduce an example architecture for a "bolt-on" ML observability system, or one that wraps around existing tools in the stack.
翻译:软件组织正在越来越多地将机器学习(ML)纳入其产品提供中,从而需要新的数据管理工具。许多这些工具便利了ML应用程序的初始开发,但由于在ML管道的任何组成部分(如数据分配转换或异常特征)可能发生的预测和静默故障缺乏实时反馈(即标签),因此难以在部署后维持这些应用。我们建议了一种新的数据管理系统,通过协助(1) 检测、(2) 诊断和(3) 对ML相关错误作出反应,为部署ML管道提供端到端的可观测性或对复杂系统行为的可见性。我们描述了新的研究挑战,并在所有三个方面提出了初步解决方案设想。最后,我们为ML管道中的任何部分(如数据分配转换或异常特征)引入了“模块式” ML可观测性系统,或围绕堆叠中现有工具的系统提供了一个示例结构。