Dependency hell is a well-known pain point in the development of large software projects and machine learning (ML) code bases are not immune from it. In fact, ML applications suffer from an additional form, namely, "data source dependency hell". This term refers to the central role played by data and its unique quirks that often lead to unexpected failures of ML models which cannot be explained by code changes. In this paper, we present an automated dependency mapping framework that allows MLOps engineers to monitor the whole dependency map of their models in a fast paced engineering environment and thus mitigate ahead of time the consequences of any data source changes (e.g., re-train model, ignore data, set default data etc.). Our system is based on a unified and generic approach, employing techniques from static analysis, from which data sources can be identified reliably for any type of dependency on a wide range of source languages and artefacts. The dependency mapping framework is exposed as a REST web API where the only input is the path to the Git repository hosting the code base. Currently used by MLOps engineers at Microsoft, we expect such dependency map APIs to be adopted more widely by MLOps engineers in the future.
翻译:依赖性地狱是开发大型软件项目和机器学习(ML)代码基础的一个众所周知的疼痛点。事实上,ML应用程序还受到另一种形式,即“数据源依赖性地狱 ” 。这个术语是指数据及其独特奇数所发挥的核心作用,往往导致ML模型出乎意料的失败,而这种失败不能用代码变化来解释。在本文中,我们提出了一个自动依赖性绘图框架,使MLOPs工程师能够在快速的工程环境中监测其模型的整个依赖性地图,从而在任何数据源变化(例如再培训模型、忽略数据、设定默认数据等)的后果之前就减轻。我们的系统基于统一和通用的方法,采用静态分析的技术,从中可以可靠地确定数据源,以任何类型的依赖源语言和人工制品。依赖性绘图框架作为REST网络API公开,其中唯一的输入是存放代码基础的Git仓库路径。目前由微软公司的MLOPs工程师使用,我们期望未来将这种依赖性API地图更广泛地用于MLOps。