We introduce the first application of the lean methodology to machine learning projects. Similar to lean startups and lean manufacturing, we argue that lean machine learning (LeanML) can drastically slash avoidable wastes in commercial machine learning projects, reduce the business risk in investing in machine learning capabilities and, in so doing, further democratize access to machine learning. The lean design pattern we propose in this paper is based on two realizations. First, it is possible to estimate the best performance one may achieve when predicting an outcome $y \in \mathcal{Y}$ using a given set of explanatory variables $x \in \mathcal{X}$, for a wide range of performance metrics, and without training any predictive model. Second, doing so is considerably easier, faster, and cheaper than learning the best predictive model. We derive formulae expressing the best $R^2$, MSE, classification accuracy, and log-likelihood per observation achievable when using $x$ to predict $y$ as a function of the mutual information $I\left(y; x\right)$, and possibly a measure of the variability of $y$ (e.g. its Shannon entropy in the case of classification accuracy, and its variance in the case regression MSE). We illustrate the efficacy of the LeanML design pattern on a wide range of regression and classification problems, synthetic and real-life.
翻译:我们引入了对机器学习项目的第一项精度方法应用。类似精度初创和精度制造,我们认为精度机学习(LeanML)可以大幅削减商业机器学习项目中可避免的废物,降低投资于机器学习能力的商业风险,从而进一步使机器学习机会民主化。我们在本文件中提议的精度设计模式基于两个认识。首先,在预测结果时,利用一套特定的解释变量($x y;x\in mathcal{Y}$)来估计最佳绩效是可能的。 对于一系列广泛的性能衡量标准,而没有培训任何预测模型,精度学习机器学习能力,从而降低企业在投资机器学习能力方面的商业风险。我们在本文件中提议的精度设计模式是基于两种认识。首先,在使用美元作为相互信息分类函数($xlefleft(y);x\right)$(xright=xxxxxxxxxxxcalx}X}x}x}美元时,可以估计最佳绩效。第二,这样做比学习到学习最佳预测模型模型的精确度的精确度,我们得出了其精确度的精确度的模型。