This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the quality of datasets. This is a work in progress as good practices evolve along with our understanding of machine learning. The document is divided into three main parts. Section 2 addresses the data collection activity. Section 3 gives recommendations about the annotation process. Finally, Section 4 gives recommendations concerning the breakdown between train, validation, and test datasets. In each part, we first define the desired properties at stake, then we explain the objectives targeted to meet the properties, finally we state the recommendations to reach these objectives.
翻译:本文件提出一套建议,用于建立和操作用于开发和/或验证深神经网络等机器学习模型的数据集;本文件是[1]中界定的确保数据集质量的3份文件之一;这是随着良好做法的演进以及我们对机器学习的理解而正在进行的工作;该文件分为三个主要部分;第2节涉及数据收集活动;第3节就说明过程提出建议;最后,第4节就火车、验证和测试数据集之间的细分提出建议;在每一部分,我们首先界定所要考虑的属性,然后我们解释旨在实现这些属性的目标,最后我们说明实现这些目标的建议。