The selection of a validation basis from a full dataset is often required in industrial use of supervised machine learning algorithm. This validation basis will serve to realize an independent evaluation of the machine learning model. To select this basis, we propose to adopt a "design of experiments" point of view, by using statistical criteria. We show that the "support points" concept, based on Maximum Mean Discrepancy criteria, is particularly relevant. An industrial test case from the company EDF illustrates the practical interest of the methodology.
翻译:从完整的数据集中选择一个验证基础往往是在工业使用监督机学习算法时所需要的。这一验证基础将有助于实现对机器学习模式的独立评估。为选择这一基础,我们建议采用统计标准来采用“设计实验”的观点。我们表明,基于最大平均值差异标准的“支持点”概念特别相关。EDF公司的一个工业测试案例表明了该方法的实际利益。