The application of machine learning to physics problems is widely found in the scientific literature. Both regression and classification problems are addressed by a large array of techniques that involve learning algorithms. Unfortunately, the measurement errors of the data used to train machine learning models are almost always neglected. This leads to estimations of the performance of the models (and thus their generalisation power) that is too optimistic since it is always assumed that the target variables (what one wants to predict) are correct. In physics, this is a dramatic deficiency as it can lead to the belief that theories or patterns exist where, in reality, they do not. This paper addresses this deficiency by deriving formulas for commonly used metrics (both for regression and classification problems) that take into account measurement errors of target variables. The new formulas give an estimation of the metrics which is always more pessimistic than what is obtained with the classical ones, not taking into account measurement errors. The formulas given here are of general validity, completely model-independent, and can be applied without limitations. Thus, with statistical confidence, one can analyze the existence of relationships when dealing with measurements with errors of any kind. The formulas have wide applicability outside physics and can be used in all problems where measurement errors are relevant to the conclusions of studies.
翻译:在科学文献中广泛发现,将机器学习应用于物理问题的做法在科学文献中广泛存在。回归和分类问题都由涉及学习算法的多种技术来解决。不幸的是,用于培训机器学习模型的数据的测量错误几乎总是被忽略。这导致对模型的性能(以及其概括性能力)的估计过于乐观,因为人们总是认为目标变量(人们想要预测的)是正确的。在物理学中,这是一个巨大的缺陷,因为它可以使人们相信理论或模式存在,而实际上它们并不存在。本文件通过为常用的衡量标准(包括回归和分类问题)得出公式来弥补这一缺陷,这些公式考虑到目标变量的测量错误。新公式对模型的性能进行了估计,这种估计总是比古典模型的性能更悲观,而没有考虑到测量错误。这里给出的公式具有一般有效性,完全依赖模型,可以不受限制地应用。因此,在统计上的信心下,人们可以分析在处理任何类型误差的测量时是否存在关系。新公式在物理和结论中所使用的所有问题都是广泛的应用性。