In order to properly train a machine learning model, data must be properly collected. To guarantee a proper data collection, verifying that the collected data set holds certain properties is a possible solution. For example, guaranteeing that the data set contains samples across the whole input space, or that the data set is balanced w.r.t. different classes. We present a formal approach for verifying a set of arbitrarily stated properties over a data set. The proposed approach relies on the transformation of the data set into a first order logic formula, which can be later verified w.r.t. the different properties also stated in the same logic. A prototype tool, which uses the z3 solver, has been developed; the prototype can take as an input a set of properties stated in a formal language and formally verify a given data set w.r.t. to the given set of properties. Preliminary experimental results show the feasibility and performance of the proposed approach, and furthermore the flexibility for expressing properties of interest.
翻译:为了正确培训机器学习模型,必须适当收集数据。为了保证适当的数据收集,核实所收集的数据集具有某些属性,这是一个可能的解决方案。例如,保证数据集包含整个输入空间的样本,或数据集是平衡的。我们提出了一个正式的方法,用以核查一组数据集上一套任意标明的属性。拟议方法依赖于将数据集转换成一个一阶逻辑公式,该逻辑公式可以稍后核实,不同属性也可以在同一逻辑中说明。已经开发了一个原型工具,该原型工具使用z3求解器;原型可以将一组以正式语言表示的属性作为输入,并正式核实给定的数据集 w.r.t。初步实验结果显示拟议方法的可行性和性能,以及表达感兴趣属性的灵活性。